neo-3-3B-A400M-Thinking

Paper

This is the chain-of-thought–tuned version of neo-3-3B-A400M-Base. For a lighter, 8K-context instruction model see neo-3-1B-A90M-Instruct.

The neo-3-3B-A400M-Thinking model is a decoder-only sparse MoE model focused on deliberate reasoning, long-context explanations, and structured intermediate thoughts. It is trained on top of the neo-3-3B-A400M base checkpoint with chain-of-thought–style supervision and instruction data, while keeping an active-parameter profile of ~400M parameters per token.

Core properties:

  • 3B total parameters, ~400M active parameters (top-2-of-8 experts per token).
  • 32K context window for multi-document reasoning, long conversations, and extended worked examples.
  • Mixtral-style MoE FFNs with grouped-query attention and RoPE.

The model is released under the MIT license and is intended as an open, reasoning-focused model that still fits on a single modern consumer GPU with efficient quantization.

Intended use

  • Long-form reasoning: multi-step math word problems, decomposition of complex tasks, detailed planning, and careful pros/cons analyses.
  • Explanatory tasks: structured explanations, teaching-style walkthroughs, and guided derivations for technical topics.
  • Code and debugging: stepwise reasoning about code, refactor plans, and “explain this function” style tasks.
  • Research workflows: summarizing and cross-referencing multiple passages inside the same 32K window.

The model is not designed for:

  • High-stakes uses (medical, legal, financial, safety-critical decisions).
  • Exact formal proofs or symbolic math without external verification.
  • Real-time, ultra-low-latency applications where a smaller model is preferable.

Evaluations

Below are performance figures for the neo-3-3B-A400M-Thinking model compared with related base and instruct models in the neo-3 family and nearby open models.

Reasoning and instruction performance

Model MMLU HellaSwag PIQA ARC avg GSM8K BBH IFEval
neo-3-3B-A400M-Thinking 54.1 65.7 76.7 56.0 16.5 41.2 54.1
Qwen3-0.6B-Thinking 44.9 37.5 66.5 46.0 36.5 30.7 64.2
Qwen3-1.7B-Thinking 59.1 48.1 67.2 49.8 51.4 48.6 70.9

Tool-calling style performance (TinyTask)

TinyTask is a benchmark that evaluates a model's ability to generate structured outputs, thus a mirror for tool calling performance. Our subset of TinyTask included 300 rows, 150 for travel problems and 150 for math problems. I made sure TinyTask outputs were not in any of my models' training data.

Model TinyTask Accuracy
neo-3-3B-A400M-Thinking 45.3
LFM2.5 1.2B Instruct 40.0
Gemma 3 IT 1B 37.0
neo-3-1B-A90M-Instruct 30.0
Qwen3-1.7B-Thinking 27.5
MainCoder-1B 22.0
Qwen3-0.6B-Thinking 10.0

Behavior in practice

  • Produces more structured, multi-step answers than neo-3-1B-A90M-Instruct on complex tasks, especially when explicitly prompted to “think step by step”.
  • Closes much of the gap to larger 3B–7B reasoning models on MMLU, ARC, GSM8K, and BBH while remaining small enough for single-GPU and Colab Pro–class setups.
  • Can be steered between concise answers and detailed chains-of-thought by adjusting prompts and decoding settings (temperature, max_new_tokens).

Usage

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "aquiffoo/neo-3-3B-A400M-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

system_prompt = (
    "You are a careful reasoning assistant. "
    "Always think step by step before answering."
)
question = "A train leaves at 14:20, travels 120 km at 80 km/h. When does it arrive?"
prompt = f"{system_prompt}\n\nQuestion: {question}\n\nAnswer with your reasoning, then the final answer."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.4,
    top_p=0.9
)
print(tokenizer.decode(output, skip_special_tokens=True))

Prompting for “thinking” traces

The thinking model works as a regular chat model, but you can explicitly separate reasoning from the final answer:

You are a deliberate assistant. First think through the problem in detail inside <scratchpad> tags, then give a short final answer outside the tags.

Problem: ...

<scratchpad>

This pattern makes it easier to log, analyze, or filter intermediate reasoning without exposing it directly to end users.

Training and data overview

  • Base model: neo-3-3B-A400M-Base trained on a mixture of Wikipedia, synthetic web-scale corpora, code (The Stack, GitHub), math, and dialogue sources.
  • Post-training:
    • Supervised fine-tuning on instruction and dialogue datasets, with emphasis on tasks that benefit from multi-step reasoning (math word problems, logic puzzles, explanation-heavy prompts).
    • Chain-of-thought–style supervision for selected problems, paired with concise final-answer formats.
  • Tokenization: SentencePiece/BPE with 32k vocabulary and RoPE-based positional encoding, shared across the neo-3 family.

Limitations and risks

  • Despite stronger reasoning, it still makes mistakes and can hallucinate plausible but incorrect steps or facts.
  • Longer chains-of-thought increase latency and computational cost per query compared with 1B-scale models.
  • Outputs may contain biases inherited from pretraining and post-training data and are not suitable for sensitive or high-stakes domains without additional filtering and oversight.
  • Users should independently verify important answers and avoid delegating critical decisions directly to the model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aquiffoo/neo-3-3B-A400M-Thinking

Finetuned
(1)
this model

Datasets used to train aquiffoo/neo-3-3B-A400M-Thinking

Collection including aquiffoo/neo-3-3B-A400M-Thinking