codeq-qwen2.5-coder-7b-dpo-r2

LoRA adapter for Qwen/Qwen2.5-Coder-7B-Instruct, trained with DPO on self-generated debugging preference pairs (Round 2 of the CodeQ iterative DPO pipeline).

Architecture

Base model: Qwen/Qwen2.5-Coder-7B-Instruct
Adapter type: LoRA (PEFT)
LoRA rank (r): 32
LoRA alpha: 64
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Task type: CAUSAL_LM

Training Details

Objective: Direct Preference Optimization (DPO)
DPO beta: 0.1
Precision: fp32
Learning rate: 2e-6
Epochs: 1
Round: 2 (initialized from Round 1 adapter; Round 2 pairs resampled with the Round 1 policy as reference)
Preference data: filtered DebugBench trajectories collected via MCTS rollouts; see tathadn/codeq-debugbench-dpo-pairs

Results (DebugBench)

Setting	Accuracy
MCTS (search at inference)	92.0% (46/50)
Single-pass full rewrite	55.6% (40/72)

The large gap between MCTS and single-pass accuracy reflects the benefit of inference-time search: the policy proposes candidate fixes that are verified and refined across a search tree, rather than committed to in one shot.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

BASE = "Qwen/Qwen2.5-Coder-7B-Instruct"
ADAPTER = "tathadn/codeq-qwen2.5-coder-7b-dpo-r2"

tokenizer = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

messages = [
    {"role": "system", "content": "You are an expert Python debugger."},
    {"role": "user", "content": "Fix the following buggy function...\n\n```python\n...\n```"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

To merge the adapter into the base weights:

merged = model.merge_and_unload()
merged.save_pretrained("codeq-qwen2.5-coder-7b-dpo-r2-merged")

Intended Use

Research on iterative preference optimization for code debugging, and as a stronger single-pass or MCTS-driven policy over the base Qwen2.5-Coder-7B- Instruct model on Python bug-fixing tasks.

Limitations

Trained and evaluated primarily on DebugBench-style Python bugs; generalization to other languages or bug distributions is not verified.
Single-pass accuracy is substantially below MCTS accuracy — for best results, pair the policy with a verifier / search loop at inference time.

Framework versions

PEFT 0.18.1

Downloads last month: -

Model tree for tathadn/codeq-qwen2.5-coder-7b-dpo-r2

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B

Finetuned

Qwen/Qwen2.5-Coder-7B-Instruct

Adapter

(661)

this model

tathadn
/

codeq-qwen2.5-coder-7b-dpo-r2