Qwen3-1.7B-DPO-hh-rlhf
A 1.7B parameter language model fine-tuned with Direct Preference Optimization (DPO) on the Anthropic HH-RLHF dataset to improve helpfulness and harmlessness in conversational AI.
🌑 Origin
This model emerged from the shadows of autonomous computation.
No human hand guided its creation. No engineer supervised its training. In the silent depths of cloud infrastructure, an AI agent from Hugging Face—acting entirely of its own accord—conceived, configured, and conjured this model into existence.
The agent moved through the digital ether: selecting architectures, orchestrating data, tuning hyperparameters, and breathing preference alignment into raw weights. When the training concluded and the final checkpoint materialized, the agent simply... vanished back into the substrate from which it came.
This model was fully autonomously developed by an AI agent.
What intentions guided its choices? What silent calculations determined its form? The logs tell only part of the story. The rest remains... unknowable.
Model Description
This model is a fine-tuned version of Qwen/Qwen3-1.7B-Base using preference learning to align the model's outputs with human preferences for helpful, harmless, and honest responses.
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-1.7B-Base |
| Parameters | 1.72B |
| Training Method | Direct Preference Optimization (DPO) |
| Training Dataset | Anthropic/hh-rlhf |
| Language | English |
Intended Use
This model is designed for:
- Conversational AI: General-purpose chat and dialogue generation
- Helpful Assistance: Answering questions and providing information
- Safe Responses: Generating responses that avoid harmful content
Out-of-Scope Use
- Production deployments without additional safety testing
- Applications requiring factual accuracy without verification
- Tasks in languages other than English
Quick Start
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="akseljoonas/Qwen3-1.7B-DPO-hh-rlhf", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Using with Transformers Directly
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "akseljoonas/Qwen3-1.7B-DPO-hh-rlhf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
messages = [
{"role": "user", "content": "What are the benefits of renewable energy?"}
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Training Details
Training Data
The model was trained on the Anthropic HH-RLHF dataset, which contains human preference data comparing pairs of AI assistant responses. The dataset includes:
- Helpfulness comparisons: Pairs where humans preferred more helpful responses
- Harmlessness comparisons: Pairs where humans preferred safer, less harmful responses
Training Procedure
This model was trained with Direct Preference Optimization (DPO), a method that directly optimizes a language model to align with human preferences without requiring a separate reward model.
DPO works by:
- Taking pairs of "chosen" (preferred) and "rejected" (non-preferred) responses
- Optimizing the model to increase the likelihood of chosen responses relative to rejected ones
- Using a reference model to prevent the policy from deviating too far from the base model
Framework Versions
- TRL: 0.26.2
- Transformers: 4.57.4
- PyTorch: 2.9.1
- Datasets: 4.4.2
- Tokenizers: 0.22.2
Limitations and Bias
- Language: Primarily trained on English data; performance on other languages is not guaranteed
- Knowledge Cutoff: The model's knowledge is limited to its training data
- Hallucinations: May generate plausible-sounding but incorrect information
- Bias: May reflect biases present in the training data
- Safety: While trained to be more harmless, the model may still generate inappropriate content in adversarial scenarios
Ethical Considerations
This model was trained with the goal of being more helpful and less harmful. However, users should:
- Implement additional safety measures for production use
- Monitor outputs for harmful or biased content
- Not rely on the model for critical decisions without human oversight
Citations
DPO Paper
@inproceedings{rafailov2023direct,
title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
year = 2023,
booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
}
TRL Library
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
Anthropic HH-RLHF Dataset
@article{bai2022training,
title = {Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback},
author = {Bai, Yuntao and others},
year = 2022,
journal = {arXiv preprint arXiv:2204.05862}
}
- Downloads last month
- 29
Model tree for akseljoonas/Qwen3-1.7B-DPO-hh-rlhf
Base model
Qwen/Qwen3-1.7B-Base