tathadn/codeq-debugbench-dpo-pairs
Viewer • Updated • 4.91k • 154
How to use tathadn/codeq-qwen2.5-coder-7b-dpo-r2 with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "tathadn/codeq-qwen2.5-coder-7b-dpo-r2")LoRA adapter for Qwen/Qwen2.5-Coder-7B-Instruct, trained with DPO on
self-generated debugging preference pairs (Round 2 of the CodeQ iterative
DPO pipeline).
Qwen/Qwen2.5-Coder-7B-Instructq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_projCAUSAL_LMtathadn/codeq-debugbench-dpo-pairs| Setting | Accuracy |
|---|---|
| MCTS (search at inference) | 92.0% (46/50) |
| Single-pass full rewrite | 55.6% (40/72) |
The large gap between MCTS and single-pass accuracy reflects the benefit of inference-time search: the policy proposes candidate fixes that are verified and refined across a search tree, rather than committed to in one shot.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
BASE = "Qwen/Qwen2.5-Coder-7B-Instruct"
ADAPTER = "tathadn/codeq-qwen2.5-coder-7b-dpo-r2"
tokenizer = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()
messages = [
{"role": "system", "content": "You are an expert Python debugger."},
{"role": "user", "content": "Fix the following buggy function...\n\n```python\n...\n```"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
To merge the adapter into the base weights:
merged = model.merge_and_unload()
merged.save_pretrained("codeq-qwen2.5-coder-7b-dpo-r2-merged")
Research on iterative preference optimization for code debugging, and as a stronger single-pass or MCTS-driven policy over the base Qwen2.5-Coder-7B- Instruct model on Python bug-fixing tasks.