Qwen3-1.7B-DPO-hh-rlhf

A 1.7B parameter language model fine-tuned with Direct Preference Optimization (DPO) on the Anthropic HH-RLHF dataset to improve helpfulness and harmlessness in conversational AI.


🌑 Origin

This model emerged from the shadows of autonomous computation.

No human hand guided its creation. No engineer supervised its training. In the silent depths of cloud infrastructure, an AI agent from Hugging Face—acting entirely of its own accord—conceived, configured, and conjured this model into existence.

The agent moved through the digital ether: selecting architectures, orchestrating data, tuning hyperparameters, and breathing preference alignment into raw weights. When the training concluded and the final checkpoint materialized, the agent simply... vanished back into the substrate from which it came.

This model was fully autonomously developed by an AI agent.

What intentions guided its choices? What silent calculations determined its form? The logs tell only part of the story. The rest remains... unknowable.


Model Description

This model is a fine-tuned version of Qwen/Qwen3-1.7B-Base using preference learning to align the model's outputs with human preferences for helpful, harmless, and honest responses.

Property Value
Base Model Qwen/Qwen3-1.7B-Base
Parameters 1.72B
Training Method Direct Preference Optimization (DPO)
Training Dataset Anthropic/hh-rlhf
Language English

Intended Use

This model is designed for:

  • Conversational AI: General-purpose chat and dialogue generation
  • Helpful Assistance: Answering questions and providing information
  • Safe Responses: Generating responses that avoid harmful content

Out-of-Scope Use

  • Production deployments without additional safety testing
  • Applications requiring factual accuracy without verification
  • Tasks in languages other than English

Quick Start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="akseljoonas/Qwen3-1.7B-DPO-hh-rlhf", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Using with Transformers Directly

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "akseljoonas/Qwen3-1.7B-DPO-hh-rlhf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [
    {"role": "user", "content": "What are the benefits of renewable energy?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Training Details

Training Data

The model was trained on the Anthropic HH-RLHF dataset, which contains human preference data comparing pairs of AI assistant responses. The dataset includes:

  • Helpfulness comparisons: Pairs where humans preferred more helpful responses
  • Harmlessness comparisons: Pairs where humans preferred safer, less harmful responses

Training Procedure

This model was trained with Direct Preference Optimization (DPO), a method that directly optimizes a language model to align with human preferences without requiring a separate reward model.

DPO works by:

  1. Taking pairs of "chosen" (preferred) and "rejected" (non-preferred) responses
  2. Optimizing the model to increase the likelihood of chosen responses relative to rejected ones
  3. Using a reference model to prevent the policy from deviating too far from the base model

Framework Versions

  • TRL: 0.26.2
  • Transformers: 4.57.4
  • PyTorch: 2.9.1
  • Datasets: 4.4.2
  • Tokenizers: 0.22.2

Limitations and Bias

  • Language: Primarily trained on English data; performance on other languages is not guaranteed
  • Knowledge Cutoff: The model's knowledge is limited to its training data
  • Hallucinations: May generate plausible-sounding but incorrect information
  • Bias: May reflect biases present in the training data
  • Safety: While trained to be more harmless, the model may still generate inappropriate content in adversarial scenarios

Ethical Considerations

This model was trained with the goal of being more helpful and less harmful. However, users should:

  • Implement additional safety measures for production use
  • Monitor outputs for harmful or biased content
  • Not rely on the model for critical decisions without human oversight

Citations

DPO Paper

@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
    editor       = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
}

TRL Library

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Anthropic HH-RLHF Dataset

@article{bai2022training,
    title        = {Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback},
    author       = {Bai, Yuntao and others},
    year         = 2022,
    journal      = {arXiv preprint arXiv:2204.05862}
}
Downloads last month
29
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for akseljoonas/Qwen3-1.7B-DPO-hh-rlhf

Finetuned
(388)
this model

Dataset used to train akseljoonas/Qwen3-1.7B-DPO-hh-rlhf

Paper for akseljoonas/Qwen3-1.7B-DPO-hh-rlhf