SevZero GRPO-primary adapter

LoRA adapter produced by the primary GRPO run for the SevZero OpenEnv India Hackathon 2026 submission.

Training recipe

Initialization: PhaseOfCode/sevzero-llama3-8b-sft-primary
Base model: unsloth/Meta-Llama-3.1-8B-Instruct
RL method: GRPO through TRL against the live SevZero FastAPI/OpenEnv surface
Steps: 120
Learning rate: 7e-6
Group size: 4 generations
Temperature: 0.85
Beta: 0.04
Scheduler: cosine
vLLM: colocate mode, GPU memory utilization 0.55

The training loop produced nonzero reward variance, gradients, and KL movement. The held-out eval did not show score lift.

Eval summary

Held-out seeds: 13, 99, 777. Tasks: Easy, Medium, Hard.

Model	Easy	Medium	Hard	Mean
Untrained Llama-3.1-8B-Instruct	0.8199	0.9419	0.6369	0.7996
GRPO-primary	0.8199	0.9419	0.6369	0.7996

The honest conclusion: 120 GRPO steps were not enough to change deterministic held-out outcomes. SevZero's contribution is the environment, training harness, and reproducible failure surface.

Model tree for PhaseOfCode/sevzero-llama3-8b-grpo-primary

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

unsloth/Meta-Llama-3.1-8B-Instruct

Adapter

(433)

this model

PhaseOfCode
/

sevzero-llama3-8b-grpo-primary

SevZero GRPO-primary adapter

Training recipe

Eval summary

Links

Model tree for PhaseOfCode/sevzero-llama3-8b-grpo-primary

Space using PhaseOfCode/sevzero-llama3-8b-grpo-primary 1