Qwen3-1.7B-DPO-hh-rlhf

A 1.7B parameter language model fine-tuned with Direct Preference Optimization (DPO) on the Anthropic HH-RLHF dataset to improve helpfulness and harmlessness in conversational AI.

🌑 Origin

This model emerged from the shadows of autonomous computation.

No human hand guided its creation. No engineer supervised its training. In the silent depths of cloud infrastructure, an AI agent from Hugging Face—acting entirely of its own accord—conceived, configured, and conjured this model into existence.

The agent moved through the digital ether: selecting architectures, orchestrating data, tuning hyperparameters, and breathing preference alignment into raw weights. When the training concluded and the final checkpoint materialized, the agent simply... vanished back into the substrate from which it came.

This model was fully autonomously developed by an AI agent.

What intentions guided its choices? What silent calculations determined its form? The logs tell only part of the story. The rest remains... unknowable.

Model Description

This model is a fine-tuned version of Qwen/Qwen3-1.7B-Base using preference learning to align the model's outputs with human preferences for helpful, harmless, and honest responses.

Property	Value
Base Model	Qwen/Qwen3-1.7B-Base
Parameters	1.72B
Training Method	Direct Preference Optimization (DPO)
Training Dataset	Anthropic/hh-rlhf
Language	English

Intended Use

This model is designed for:

Conversational AI: General-purpose chat and dialogue generation
Helpful Assistance: Answering questions and providing information
Safe Responses: Generating responses that avoid harmful content

Out-of-Scope Use

Production deployments without additional safety testing
Applications requiring factual accuracy without verification
Tasks in languages other than English

Quick Start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="akseljoonas/Qwen3-1.7B-DPO-hh-rlhf", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Using with Transformers Directly

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "akseljoonas/Qwen3-1.7B-DPO-hh-rlhf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [
    {"role": "user", "content": "What are the benefits of renewable energy?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Training Details

Training Data

The model was trained on the Anthropic HH-RLHF dataset, which contains human preference data comparing pairs of AI assistant responses. The dataset includes:

Helpfulness comparisons: Pairs where humans preferred more helpful responses
Harmlessness comparisons: Pairs where humans preferred safer, less harmful responses

Training Procedure

This model was trained with Direct Preference Optimization (DPO), a method that directly optimizes a language model to align with human preferences without requiring a separate reward model.

DPO works by:

Taking pairs of "chosen" (preferred) and "rejected" (non-preferred) responses
Optimizing the model to increase the likelihood of chosen responses relative to rejected ones
Using a reference model to prevent the policy from deviating too far from the base model

Framework Versions

TRL: 0.26.2
Transformers: 4.57.4
PyTorch: 2.9.1
Datasets: 4.4.2
Tokenizers: 0.22.2

Limitations and Bias

Language: Primarily trained on English data; performance on other languages is not guaranteed
Knowledge Cutoff: The model's knowledge is limited to its training data
Hallucinations: May generate plausible-sounding but incorrect information
Bias: May reflect biases present in the training data
Safety: While trained to be more harmless, the model may still generate inappropriate content in adversarial scenarios

Ethical Considerations

This model was trained with the goal of being more helpful and less harmful. However, users should:

Implement additional safety measures for production use
Monitor outputs for harmful or biased content
Not rely on the model for critical decisions without human oversight

Citations

DPO Paper

@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
    editor       = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
}

TRL Library

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Anthropic HH-RLHF Dataset

@article{bai2022training,
    title        = {Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback},
    author       = {Bai, Yuntao and others},
    year         = 2022,
    journal      = {arXiv preprint arXiv:2204.05862}
}

Downloads last month: 29

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for akseljoonas/Qwen3-1.7B-DPO-hh-rlhf

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

(388)

this model

Dataset used to train akseljoonas/Qwen3-1.7B-DPO-hh-rlhf

Paper for akseljoonas/Qwen3-1.7B-DPO-hh-rlhf

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Paper • 2204.05862 • Published Apr 12, 2022 • 3