neo-3-3B-A400M-Thinking

This is the chain-of-thought–tuned version of neo-3-3B-A400M-Base. For a lighter, 8K-context instruction model see neo-3-1B-A90M-Instruct.

The neo-3-3B-A400M-Thinking model is a decoder-only sparse MoE model focused on deliberate reasoning, long-context explanations, and structured intermediate thoughts. It is trained on top of the neo-3-3B-A400M base checkpoint with chain-of-thought–style supervision and instruction data, while keeping an active-parameter profile of ~400M parameters per token.

Core properties:

3B total parameters, ~400M active parameters (top-2-of-8 experts per token).
32K context window for multi-document reasoning, long conversations, and extended worked examples.
Mixtral-style MoE FFNs with grouped-query attention and RoPE.

The model is released under the MIT license and is intended as an open, reasoning-focused model that still fits on a single modern consumer GPU with efficient quantization.

Intended use

Long-form reasoning: multi-step math word problems, decomposition of complex tasks, detailed planning, and careful pros/cons analyses.
Explanatory tasks: structured explanations, teaching-style walkthroughs, and guided derivations for technical topics.
Code and debugging: stepwise reasoning about code, refactor plans, and “explain this function” style tasks.
Research workflows: summarizing and cross-referencing multiple passages inside the same 32K window.

The model is not designed for:

High-stakes uses (medical, legal, financial, safety-critical decisions).
Exact formal proofs or symbolic math without external verification.
Real-time, ultra-low-latency applications where a smaller model is preferable.

Evaluations

Below are performance figures for the neo-3-3B-A400M-Thinking model compared with related base and instruct models in the neo-3 family and nearby open models.

Reasoning and instruction performance

Model	MMLU	HellaSwag	PIQA	ARC avg	GSM8K	BBH	IFEval
neo-3-3B-A400M-Thinking	54.1	65.7	76.7	56.0	16.5	41.2	54.1
Qwen3-0.6B-Thinking	44.9	37.5	66.5	46.0	36.5	30.7	64.2
Qwen3-1.7B-Thinking	59.1	48.1	67.2	49.8	51.4	48.6	70.9

Tool-calling style performance (TinyTask)

TinyTask is a benchmark that evaluates a model's ability to generate structured outputs, thus a mirror for tool calling performance. Our subset of TinyTask included 300 rows, 150 for travel problems and 150 for math problems. I made sure TinyTask outputs were not in any of my models' training data.

Model	TinyTask Accuracy
neo-3-3B-A400M-Thinking	45.3
LFM2.5 1.2B Instruct	40.0
Gemma 3 IT 1B	37.0
neo-3-1B-A90M-Instruct	30.0
Qwen3-1.7B-Thinking	27.5
MainCoder-1B	22.0
Qwen3-0.6B-Thinking	10.0

Behavior in practice

Produces more structured, multi-step answers than neo-3-1B-A90M-Instruct on complex tasks, especially when explicitly prompted to “think step by step”.
Closes much of the gap to larger 3B–7B reasoning models on MMLU, ARC, GSM8K, and BBH while remaining small enough for single-GPU and Colab Pro–class setups.
Can be steered between concise answers and detailed chains-of-thought by adjusting prompts and decoding settings (temperature, max_new_tokens).

Usage

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "aquiffoo/neo-3-3B-A400M-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

system_prompt = (
    "You are a careful reasoning assistant. "
    "Always think step by step before answering."
)
question = "A train leaves at 14:20, travels 120 km at 80 km/h. When does it arrive?"
prompt = f"{system_prompt}\n\nQuestion: {question}\n\nAnswer with your reasoning, then the final answer."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.4,
    top_p=0.9
)
print(tokenizer.decode(output, skip_special_tokens=True))

Prompting for “thinking” traces

The thinking model works as a regular chat model, but you can explicitly separate reasoning from the final answer:

You are a deliberate assistant. First think through the problem in detail inside <scratchpad> tags, then give a short final answer outside the tags.

Problem: ...

<scratchpad>

This pattern makes it easier to log, analyze, or filter intermediate reasoning without exposing it directly to end users.

Training and data overview

Base model: neo-3-3B-A400M-Base trained on a mixture of Wikipedia, synthetic web-scale corpora, code (The Stack, GitHub), math, and dialogue sources.
Post-training:
- Supervised fine-tuning on instruction and dialogue datasets, with emphasis on tasks that benefit from multi-step reasoning (math word problems, logic puzzles, explanation-heavy prompts).
- Chain-of-thought–style supervision for selected problems, paired with concise final-answer formats.
Tokenization: SentencePiece/BPE with 32k vocabulary and RoPE-based positional encoding, shared across the neo-3 family.

Limitations and risks

Despite stronger reasoning, it still makes mistakes and can hallucinate plausible but incorrect steps or facts.
Longer chains-of-thought increase latency and computational cost per query compared with 1B-scale models.
Outputs may contain biases inherited from pretraining and post-training data and are not suitable for sensitive or high-stakes domains without additional filtering and oversight.
Users should independently verify important answers and avoid delegating critical decisions directly to the model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for aquiffoo/neo-3-3B-A400M-Thinking

Base model

aquiffoo/neo-3-3B-A400M-Base

Finetuned

(1)

this model

Datasets used to train aquiffoo/neo-3-3B-A400M-Thinking

Collection including aquiffoo/neo-3-3B-A400M-Thinking

neo-3

Collection

My series of fully open, state-of-the-art small mixture-of-experts models. • 13 items • Updated 19 days ago • 1