neo-3-3B-A400M-Thinking
This is the chain-of-thought–tuned version of neo-3-3B-A400M-Base. For a lighter, 8K-context instruction model see neo-3-1B-A90M-Instruct.
The neo-3-3B-A400M-Thinking model is a decoder-only sparse MoE model focused on deliberate reasoning, long-context explanations, and structured intermediate thoughts. It is trained on top of the neo-3-3B-A400M base checkpoint with chain-of-thought–style supervision and instruction data, while keeping an active-parameter profile of ~400M parameters per token.
Core properties:
- 3B total parameters, ~400M active parameters (top-2-of-8 experts per token).
- 32K context window for multi-document reasoning, long conversations, and extended worked examples.
- Mixtral-style MoE FFNs with grouped-query attention and RoPE.
The model is released under the MIT license and is intended as an open, reasoning-focused model that still fits on a single modern consumer GPU with efficient quantization.
Intended use
- Long-form reasoning: multi-step math word problems, decomposition of complex tasks, detailed planning, and careful pros/cons analyses.
- Explanatory tasks: structured explanations, teaching-style walkthroughs, and guided derivations for technical topics.
- Code and debugging: stepwise reasoning about code, refactor plans, and “explain this function” style tasks.
- Research workflows: summarizing and cross-referencing multiple passages inside the same 32K window.
The model is not designed for:
- High-stakes uses (medical, legal, financial, safety-critical decisions).
- Exact formal proofs or symbolic math without external verification.
- Real-time, ultra-low-latency applications where a smaller model is preferable.
Evaluations
Below are performance figures for the neo-3-3B-A400M-Thinking model compared with related base and instruct models in the neo-3 family and nearby open models.
Reasoning and instruction performance
| Model | MMLU | HellaSwag | PIQA | ARC avg | GSM8K | BBH | IFEval |
|---|---|---|---|---|---|---|---|
| neo-3-3B-A400M-Thinking | 54.1 | 65.7 | 76.7 | 56.0 | 16.5 | 41.2 | 54.1 |
| Qwen3-0.6B-Thinking | 44.9 | 37.5 | 66.5 | 46.0 | 36.5 | 30.7 | 64.2 |
| Qwen3-1.7B-Thinking | 59.1 | 48.1 | 67.2 | 49.8 | 51.4 | 48.6 | 70.9 |
Tool-calling style performance (TinyTask)
TinyTask is a benchmark that evaluates a model's ability to generate structured outputs, thus a mirror for tool calling performance. Our subset of TinyTask included 300 rows, 150 for travel problems and 150 for math problems. I made sure TinyTask outputs were not in any of my models' training data.
| Model | TinyTask Accuracy |
|---|---|
| neo-3-3B-A400M-Thinking | 45.3 |
| LFM2.5 1.2B Instruct | 40.0 |
| Gemma 3 IT 1B | 37.0 |
| neo-3-1B-A90M-Instruct | 30.0 |
| Qwen3-1.7B-Thinking | 27.5 |
| MainCoder-1B | 22.0 |
| Qwen3-0.6B-Thinking | 10.0 |
Behavior in practice
- Produces more structured, multi-step answers than neo-3-1B-A90M-Instruct on complex tasks, especially when explicitly prompted to “think step by step”.
- Closes much of the gap to larger 3B–7B reasoning models on MMLU, ARC, GSM8K, and BBH while remaining small enough for single-GPU and Colab Pro–class setups.
- Can be steered between concise answers and detailed chains-of-thought by adjusting prompts and decoding settings (temperature, max_new_tokens).
Usage
Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "aquiffoo/neo-3-3B-A400M-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
system_prompt = (
"You are a careful reasoning assistant. "
"Always think step by step before answering."
)
question = "A train leaves at 14:20, travels 120 km at 80 km/h. When does it arrive?"
prompt = f"{system_prompt}\n\nQuestion: {question}\n\nAnswer with your reasoning, then the final answer."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.4,
top_p=0.9
)
print(tokenizer.decode(output, skip_special_tokens=True))
Prompting for “thinking” traces
The thinking model works as a regular chat model, but you can explicitly separate reasoning from the final answer:
You are a deliberate assistant. First think through the problem in detail inside <scratchpad> tags, then give a short final answer outside the tags.
Problem: ...
<scratchpad>
This pattern makes it easier to log, analyze, or filter intermediate reasoning without exposing it directly to end users.
Training and data overview
- Base model: neo-3-3B-A400M-Base trained on a mixture of Wikipedia, synthetic web-scale corpora, code (The Stack, GitHub), math, and dialogue sources.
- Post-training:
- Supervised fine-tuning on instruction and dialogue datasets, with emphasis on tasks that benefit from multi-step reasoning (math word problems, logic puzzles, explanation-heavy prompts).
- Chain-of-thought–style supervision for selected problems, paired with concise final-answer formats.
- Tokenization: SentencePiece/BPE with 32k vocabulary and RoPE-based positional encoding, shared across the neo-3 family.
Limitations and risks
- Despite stronger reasoning, it still makes mistakes and can hallucinate plausible but incorrect steps or facts.
- Longer chains-of-thought increase latency and computational cost per query compared with 1B-scale models.
- Outputs may contain biases inherited from pretraining and post-training data and are not suitable for sensitive or high-stakes domains without additional filtering and oversight.
- Users should independently verify important answers and avoid delegating critical decisions directly to the model.
Model tree for aquiffoo/neo-3-3B-A400M-Thinking
Base model
aquiffoo/neo-3-3B-A400M-Base