SevZero GRPO-primary adapter
LoRA adapter produced by the primary GRPO run for the SevZero OpenEnv India Hackathon 2026 submission.
Training recipe
- Initialization:
PhaseOfCode/sevzero-llama3-8b-sft-primary - Base model:
unsloth/Meta-Llama-3.1-8B-Instruct - RL method: GRPO through TRL against the live SevZero FastAPI/OpenEnv surface
- Steps: 120
- Learning rate:
7e-6 - Group size: 4 generations
- Temperature: 0.85
- Beta: 0.04
- Scheduler: cosine
- vLLM: colocate mode, GPU memory utilization 0.55
The training loop produced nonzero reward variance, gradients, and KL movement. The held-out eval did not show score lift.
Eval summary
Held-out seeds: 13, 99, 777. Tasks: Easy, Medium, Hard.
| Model | Easy | Medium | Hard | Mean |
|---|---|---|---|---|
| Untrained Llama-3.1-8B-Instruct | 0.8199 | 0.9419 | 0.6369 | 0.7996 |
| GRPO-primary | 0.8199 | 0.9419 | 0.6369 | 0.7996 |
The honest conclusion: 120 GRPO steps were not enough to change deterministic held-out outcomes. SevZero's contribution is the environment, training harness, and reproducible failure surface.
Links
- Final mirrored adapter: https://huggingface.co/Mist-ic/sevzero-llama3-8b-grpo
- Environment Space: https://huggingface.co/spaces/Mist-ic/sevzero-env
- Blog: https://huggingface.co/spaces/Mist-ic/sevzero-env/blob/main/BLOG.md
- Eval dataset: https://huggingface.co/datasets/Mist-ic/sevzero-eval-results
- Downloads last month
- 32
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for PhaseOfCode/sevzero-llama3-8b-grpo-primary
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct Finetuned
unsloth/Meta-Llama-3.1-8B-Instruct