SevZero GRPO-primary adapter

LoRA adapter produced by the primary GRPO run for the SevZero OpenEnv India Hackathon 2026 submission.

Training recipe

  • Initialization: PhaseOfCode/sevzero-llama3-8b-sft-primary
  • Base model: unsloth/Meta-Llama-3.1-8B-Instruct
  • RL method: GRPO through TRL against the live SevZero FastAPI/OpenEnv surface
  • Steps: 120
  • Learning rate: 7e-6
  • Group size: 4 generations
  • Temperature: 0.85
  • Beta: 0.04
  • Scheduler: cosine
  • vLLM: colocate mode, GPU memory utilization 0.55

The training loop produced nonzero reward variance, gradients, and KL movement. The held-out eval did not show score lift.

Eval summary

Held-out seeds: 13, 99, 777. Tasks: Easy, Medium, Hard.

Model Easy Medium Hard Mean
Untrained Llama-3.1-8B-Instruct 0.8199 0.9419 0.6369 0.7996
GRPO-primary 0.8199 0.9419 0.6369 0.7996

The honest conclusion: 120 GRPO steps were not enough to change deterministic held-out outcomes. SevZero's contribution is the environment, training harness, and reproducible failure surface.

Links

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for PhaseOfCode/sevzero-llama3-8b-grpo-primary

Space using PhaseOfCode/sevzero-llama3-8b-grpo-primary 1