MiniMax-M2.7-NVFP4

A high-calibration-quality NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 for NVIDIA Blackwell GPUs.

5,000 calibration samples across 5 diverse datasets. Pre-calibrated FP8 affine KV cache. 100% expert coverage. Attention preserved in BF16. Seeded random sampling. Fully reproducible recipe. Validated at 90 tok/s single-user, 631 tok/s aggregate at N=64 on 2× RTX PRO 6000 Blackwell.

Model Description

MiniMax-M2.7-NVFP4 is an expert-only NVFP4-quantized version of MiniMax-M2.7, a 230B-parameter Mixture-of-Experts language model with 10B active parameters and 256 experts (top-8 routing).

The original FP8 checkpoint was loaded, dequantized to BF16 during calibration, then quantized to NVFP4 (4-bit with blockwise FP8 E4M3 scales per 16 elements) using NVIDIA Model Optimizer (v0.43.0).

What's quantized

Only the MoE expert MLP layers (gate, up, and down projections across all 256 experts) are quantized to NVFP4. Everything else stays in BF16:

✅ Expert MLPs (256 × 3 projections × 62 layers) → NVFP4
❌ Self-attention (Q/K/V/O projections) → BF16 (preserved for coherence and instruction-following quality)
❌ Router/gate weights → BF16
❌ Layer norms → BF16
❌ Embedding / LM head → BF16

This follows NVIDIA's recommendation for MoE models and aligns with research (Egiazarian et al., 2025) demonstrating that attention layers are quality-sensitive under FP4 quantization.

Pre-calibrated FP8 KV Cache

This checkpoint includes pre-calibrated FP8 affine KV cache scales (k_scale and v_scale tensors). This means:

Serving frameworks (SGLang, vLLM) can use FP8 KV cache out of the box
FP8 KV cache halves memory vs BF16 KV, effectively doubling your context budget
No runtime KV scale computation needed — scales were computed during calibration

Motivation and Design Rationale

When quantising a 256-expert MoE model, several properties of the standard quantisation workflow raised concerns that I wanted to address empirically:

1. Calibration sample diversity

Hypothesis: With top-8/256 routing, the majority of experts activate only for specific token distributions. A small, domain-narrow calibration set (e.g., 512 samples of CNN news) may leave many experts with poorly representative activation statistics, resulting in suboptimal FP4 scale factors.

Approach: I calibrated with 5,000 samples drawn from 5 datasets spanning competitive coding, mathematical reasoning, multi-turn instruction following, STEM/chat, and function calling/tool use. At ~20.5M tokens with top-8 routing, each expert sees approximately 640K tokens on average — well past the stability threshold for max calibration scale estimation (see Calibration Sample Count Analysis below).

2. Sequential vs. random sampling

Hypothesis: NVIDIA's default hf_ptq.py takes the first N samples sequentially from each dataset. For datasets sorted by source, difficulty, or topic, this could bias calibration toward a narrow sub-population (e.g., only AIZU problems from OpenCodeReasoning).

Approach: I patched the dataset sampling to use dataset.shuffle(seed=42, buffer_size=10000), drawing samples randomly from a 10K-entry buffer while maintaining full reproducibility via the fixed seed.

3. Expert calibration completeness

Hypothesis: Under natural routing with limited calibration data, rarely-activated experts may not accumulate sufficient activation statistics for accurate scale computation. The moe_calib_experts_ratio parameter controls whether all experts participate in calibration.

Approach: I set moe_calib_experts_ratio=1.0 to ensure all 256 experts are included in FP4 scale computation, regardless of activation frequency during calibration.

4. Attention layer preservation

Hypothesis: Quantising attention layers to FP4 risks degrading coherence, instruction following, and long-range dependency modelling — especially given research showing Hadamard rotation is detrimental for NVFP4 at block size 16 (Egiazarian et al., 2025).

Approach: I used nvfp4_experts_only, which preserves all attention layers in BF16 while quantising only expert MLPs. This mirrors NVIDIA's own strategy for their DeepSeek-R1-NVFP4 checkpoint.

5. Agentic/tool-calling coverage

Hypothesis: M2.7 is designed for agentic tool-calling workflows. Calibrating without function-calling-formatted data means the experts responsible for tool-use token patterns may receive unrepresentative activation statistics.

Approach: I included nvidia/Llama-Nemotron-Post-Training-Dataset which contains function calling, tool use, and reasoning on/off mode switching data — directly relevant to M2.7's primary use case.

Calibration Dataset Composition

Dataset	Samples	Domain	Chat template
`nvidia/OpenCodeReasoning`	1,000	Competitive coding, reasoning chains	✅
`nvidia/OpenMathReasoning`	1,000	Mathematical reasoning	✅
`Magpie-Align/Magpie-Pro-MT-300K-v0.1`	1,000	Multi-turn instruction following	✅
`nvidia/Nemotron-Post-Training-Dataset-v2`	1,000	STEM, chat, math, code	✅
`nvidia/Llama-Nemotron-Post-Training-Dataset`	1,000	Function calling, tool use, reasoning on/off	✅

All 5 datasets use the messages format, meaning the tokenizer's apply_chat_template is automatically invoked during preprocessing. This ensures calibration activations match real inference patterns.

Summary of Design Choices vs. Standard Defaults

The table below summarises where this quantisation departs from the typical ModelOpt PTQ defaults and why. The standard defaults are sensible for quick iteration; the choices here prioritise calibration thoroughness for a one-time quantisation intended for long-term use.

Dimension	Standard default	This release	Rationale
Calibration samples	128–512	5,000	10× standard; well past diminishing returns threshold (see analysis below)
Calibration datasets	1–2 generic (e.g., CNN/DailyMail)	5 domain-specific	Covers code, math, instruction, STEM, and tool-use
Sampling strategy	Sequential (first N entries)	Seeded random (seed=42, buffer=10K)	Avoids sub-population bias in sorted datasets
Expert coverage	Natural routing only	100% (`moe_calib_experts_ratio=1.0`)	Guarantees rarely-activated experts receive calibration
KV cache	Uncalibrated (BF16 at runtime)	FP8 affine (pre-calibrated scales)	Halves KV memory; doubles effective context budget
Attention precision	Varies by recipe	BF16 (preserved)	Protects coherence and instruction-following quality
Sequence length	~512	4096	Captures longer-range activation distributions
Reproducibility	Partial or undocumented	Full recipe, patches, and seed published	Enables independent verification

Hardware Requirements

Configuration	VRAM	KV cache tokens	Context	Notes
2× RTX PRO 6000 Blackwell (96GB each)	~70GB/GPU weights	~198K (FP8 KV)	131K	Tested and verified
2× RTX PRO 6000 Blackwell (BF16 KV)	~70GB/GPU weights	~99K (BF16 KV)	65K	Conservative, no FP8 KV
4× RTX 5090 (32GB each)	~33GB/GPU weights	~2K	~2K	Tight, short context only
4× H200 (141GB each)	~33GB/GPU weights	Very large	Very large	Datacenter config
2× B200 (192GB each)	~65GB/GPU weights	Very large	Very large	Datacenter config

NVFP4 requires NVIDIA Blackwell GPUs (SM100/SM120). This model will not run on Hopper (H100/H200), Ada (RTX 4090), or older architectures.

How to Run

SGLang (recommended)

Tested and verified on 2× RTX PRO 6000 Blackwell (96GB GDDR7 each). The configuration below was used for all quality and throughput testing.

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_IB_DISABLE=1
export NCCL_P2P_LEVEL=PHB
export SGLANG_DISABLE_CUDNN_CHECK=1

python -m sglang.launch_server \
    --model-path <path-or-repo>/MiniMax-M2.7-NVFP4 \
    --served-model-name minimax-m2.7 \
    --trust-remote-code \
    --tp 2 --ep 2 \
    --quantization modelopt_fp4 \
    --mem-fraction-static 0.90 \
    --context-length 131072 \
    --max-running-requests 16 \
    --chunked-prefill-size 8192 \
    --kv-cache-dtype fp8_e5m2 \
    --attention-backend fa3 \
    --moe-runner-backend flashinfer_cutlass \
    --disable-custom-all-reduce \
    --enable-flashinfer-allreduce-fusion \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --host 0.0.0.0 \
    --port 8000

Notes on the configuration:

--kv-cache-dtype fp8_e5m2: This checkpoint includes pre-calibrated FP8 affine KV cache scales. However, SGLang does not currently load them due to a naming mismatch (qkqkv_proj.k_scale vs qkqkv_proj.attn.k_scale). SGLang falls back to dynamic FP8 scales at runtime, which still halves KV memory vs BF16 and doubles the token pool. Once SGLang updates their M2.7 model definition, the pre-calibrated scales will be used automatically.
--attention-backend fa3: Flash Attention 3, optimised for Blackwell (SM120). Benchmarked at ~90 tok/s vs ~88 tok/s with flashinfer. Use flashinfer as a fallback if fa3 is unavailable on your hardware.
--context-length 131072: With FP8 KV cache, the server allocates ~198K tokens in the KV pool. Setting context-length to 131K leaves headroom for concurrent requests.
--disable-custom-all-reduce and --enable-flashinfer-allreduce-fusion: Required for TP2 stability on Blackwell without NVLink.
--chunked-prefill-size 8192: Prevents long prompts from blocking the scheduler when multiple agents send large contexts.
You may see "DeepGemm scale_fmt not ue8m0" warnings — these are cosmetic and do not affect output quality.

Verified Performance (2× RTX PRO 6000 Blackwell)

Metric	Value
Model size on disk	131GB (vs 215GB FP8 original)
VRAM per GPU (weights)	~70GB
KV cache dtype	FP8 E5M2 (dynamic scales)
KV cache token pool	~198K tokens
Context length	131,072 tokens
Decode throughput (single user)	~90 tok/s
Model load time	~27 seconds
CUDA graph capture	~34 seconds

Concurrency Scaling (2× RTX PRO 6000 Blackwell)

Tested with 512-token generation requests at each concurrency level:

Concurrent requests	Aggregate tok/s	Per-agent tok/s	VRAM peak
1	90	90	182.8 GB
4	228	57	182.9 GB
8	354	44	182.8 GB
16	561	35	182.9 GB
32	590	28	182.8 GB
64	631	21	182.7 GB

Peak aggregate throughput: 631 tok/s at N=64. Practical limit (>20 tok/s per agent): N=64. VRAM usage is essentially flat — only ~117 MiB KV growth from N=1 to N=64, meaning the 198K-token KV pool easily accommodates short-to-medium context concurrent requests.

vLLM (untested — theoretical config)

Note: This configuration has not been tested by me, as I use SGLang primarily. It is provided as a starting point based on vLLM's documented NVFP4/ModelOpt support. If you test it, please report results.

export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_NVFP4_GEMM_BACKEND=cutlass

python -m vllm.entrypoints.openai.api_server \
    --model <path-or-repo>/MiniMax-M2.7-NVFP4 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --quantization modelopt \
    --gpu-memory-utilization 0.95 \
    --max-model-len 131072 \
    --kv-cache-dtype fp8_e5m2 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --host 0.0.0.0 \
    --port 8000

Recommended inference parameters

Per MiniMax's guidance:

temperature=1.0
top_p=0.95
top_k=40

Backend Optimization Notes

The following backends were benchmarked on 2× RTX PRO 6000 Blackwell (SM120). The selected configuration represents the optimal combination found so far, but this was surface-level testing only — I tested the most obvious backend combinations and moved on. There are likely further performance unlocks available through deeper kernel tuning, MoE runner experimentation, or serving configuration changes that I haven't explored here. Contributions and findings welcome.

Backend configuration	Avg tok/s	Notes
flashinfer attn + flashinfer_cutlass MoE + flashinfer_cudnn FP4	88.4	Stable baseline
fa3 attn + flashinfer_cutlass MoE + flashinfer_cudnn FP4	90.0	Selected — +1.8%
flashinfer attn + flashinfer_mxfp4 MoE	OOM	Incompatible with this model
Baseline + `--num-continuous-decode-steps 2`	88.8	Negligible gain

flashinfer_cudnn is auto-selected by SGLang on Blackwell for FP4 GEMM operations — this is the optimal path designed by NVIDIA for their own hardware.
fa3 (Flash Attention 3) provides a small but consistent improvement on Blackwell. Fall back to flashinfer on non-Blackwell GPUs.
MoE kernel tuning (triton) does not apply here — flashinfer_cutlass bypasses the triton MoE path entirely.

Baseline SGLang Config (conservative)

If you encounter issues with FA3 or FP8 KV on your hardware, use this proven-stable baseline:

python -m sglang.launch_server \
    --model-path <path-or-repo>/MiniMax-M2.7-NVFP4 \
    --served-model-name minimax-m2.7 \
    --trust-remote-code \
    --tp 2 --ep 2 \
    --quantization modelopt_fp4 \
    --mem-fraction-static 0.90 \
    --context-length 65536 \
    --max-running-requests 16 \
    --chunked-prefill-size 8192 \
    --kv-cache-dtype bf16 \
    --attention-backend flashinfer \
    --moe-runner-backend flashinfer_cutlass \
    --disable-custom-all-reduce \
    --enable-flashinfer-allreduce-fusion \
    --host 0.0.0.0 \
    --port 8000

This uses BF16 KV cache (halves the token pool to ~99K but avoids any FP8 KV compatibility issues) and flashinfer attention (universally supported).

Architecture Quick Reference

Property	Value
Total parameters	230B
Active parameters	10B
Experts	256 (top-8 routing)
Layers	62
Hidden size	3,072
Expert intermediate	1,536
Context window	204,800 (native)
Quantized layers	Expert MLPs only (NVFP4)
Preserved layers	Attention, router, norms, embeddings (BF16)
Quantizers inserted	96,165

Full Quantization Recipe (Reproducible)

I performed this on 2× NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 each) with an AMD Threadripper PRO 9955WX and 256GB DDR5 RAM, but the recipe is not tied to this hardware. You can run it on any setup with enough combined GPU + CPU memory to hold the model during calibration — ModelOpt automatically offloads layers to CPU when they don't fit in VRAM. More VRAM means faster calibration (less offloading), but even a single 24GB GPU with sufficient system RAM should work, just slower. The key requirement is enough total memory (GPU + CPU) to hold the FP8/BF16 weights (~215GB) plus calibration activations.

Environment setup

conda create -n quantize python=3.12 -y
conda activate quantize
pip install "nvidia-modelopt[all]" torch transformers accelerate sentencepiece protobuf hf_transfer
git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer
pip install -e ".[all]"

Patches applied to NVIDIA Model Optimizer

I applied two patches to the Model Optimizer source. Both address practical issues encountered during quantisation of large MoE models on consumer multi-GPU setups.

1. Randomised calibration sampling (dataset_utils.py line 303):

# After: dataset_splits = [load_dataset(streaming=True, **config, split=s) for s in splits]
# Added:
dataset_splits = [ds.shuffle(seed=42, buffer_size=10000) for ds in dataset_splits]

The default implementation takes the first N samples sequentially. This patch randomises within a 10K-entry streaming buffer with a fixed seed, ensuring diverse sampling while maintaining reproducibility.

2. CPU export compatibility (huggingface.py lines 1103, 1116):

# Changed:
# with torch.cuda.device(self.weight.device):
# To:
with (torch.cuda.device(self.weight.device) if self.weight.device.type == "cuda" else torch.no_grad()):

When model weights are split across GPU and CPU (common on 2-GPU setups where the full model exceeds total VRAM), the default HuggingFace export function assumes all weights are on CUDA. This patch handles CPU-offloaded layers gracefully.

Quantization command

export HF_HUB_TRUST_REMOTE_CODE=1

python hf_ptq.py \
    --pyt_ckpt_path /mnt/models/MiniMax-M2.7 \
    --qformat nvfp4_experts_only \
    --export_fmt hf \
    --export_path /mnt/models/MiniMax-M2.7-NVFP4 \
    --trust_remote_code \
    --dataset "open_code_reasoning,open_math_reasoning,magpie,nemotron-post-training-dataset-v2,llama-nemotron-post-training-dataset" \
    --calib_size "1000,1000,1000,1000,1000" \
    --calib_seq 4096 \
    --kv_cache_qformat fp8_affine \
    --moe_calib_experts_ratio 1.0 \
    --inference_tensor_parallel 2 \
    --batch_size 4 \
    --skip_generate \
    --verbose

Key quantization parameters

Parameter	Value	Rationale
`qformat`	`nvfp4_experts_only`	Only expert MLPs quantised; attention, routers, norms stay BF16
`calib_size`	1000 × 5 datasets = 5,000	10× standard; past diminishing returns for `max` calibration (see analysis)
`calib_seq`	4096	Captures longer-range activation patterns vs. default 512
`kv_cache_qformat`	`fp8_affine`	Pre-calibrated per-channel affine FP8 KV scales included in checkpoint
`moe_calib_experts_ratio`	1.0	All 256 experts participate in calibration
`inference_tensor_parallel`	2	Model split across 2 GPUs for calibration forward passes
`batch_size`	4	Parallel calibration; 4× throughput vs. default batch_size=1
Shuffle patch	`seed=42, buffer_size=10000`	Randomised sampling with reproducibility

Architecture details

Model: MiniMaxM2ForCausalLM
Total parameters: 230B
Active parameters per token: 10B
Experts per layer: 256
Experts per token: 8 (top-k routing)
Shared experts: None
Layers: 62
Hidden size: 3072
Context length: 196,608 tokens
Quantizers inserted: 96,165
Attention: BF16 (not quantised)
Expert MLPs: NVFP4 (E2M1 + FP8 E4M3 block scales + FP32 tensor scales)
KV cache: FP8 affine (pre-calibrated k_scale/v_scale tensors)

Calibration Sample Count Analysis

Choosing an appropriate calibration sample count for a 256-expert MoE model requires balancing statistical coverage against compute cost. I arrived at 5,000 samples through the following analysis.

How `max` calibration works

NVIDIA's ModelOpt uses max calibration by default: for each quantised tensor, it records the maximum absolute activation value observed during calibration, then sets the FP4 scale factor to map that maximum to the largest representable FP4 value. The accuracy of the resulting scale factor therefore depends on how well the observed maximum approximates the true maximum of the activation distribution.

The sample maximum of a distribution converges logarithmically — doubling the sample count produces smaller and smaller improvements. The critical question is: how many tokens must each expert see before its observed maximum stabilises?

Per-expert token coverage at different sample counts

For M2.7 (256 experts, top-8 routing, 4096 tokens per sample), the average tokens per expert at different calibration sizes:

Total samples	Total tokens	Avg tokens/expert	Worst-case expert (~10× below avg)
512 (standard)	2.1M	65K	~6.5K
1,500	6.1M	192K	~19.2K
5,000	20.5M	640K	~64K
30,000	123M	3.8M	~380K

MoE routing follows a Zipf-like distribution — popular experts may fire 10× more than cold ones. The "worst-case expert" column estimates the token count for an expert at the 10th percentile of activation frequency.

Stability thresholds for `max` calibration

Based on the properties of order statistics for heavy-tailed distributions (typical of neural network activations):

~100 activations: Scale estimate is noisy; outlier-sensitive. Cold experts at this level risk poor FP4 scales.
~1,000 activations: Scale estimate stabilises for most weight distributions. Adequate for a rough quantisation.
~10,000 activations: The observed maximum has almost certainly captured the true tail of the distribution. Diminishing returns begin here.
~100,000+ activations: Marginal improvement is negligible. Additional samples refine the scale estimate by fractions of a percent.

Why 5,000 and not 512

At the standard 512 samples, the coldest experts see ~6.5K tokens — just past the stability threshold, with no margin. Any routing imbalance or domain mismatch in the calibration data could push individual experts below the threshold, resulting in FP4 scales that clip or underutilise the representable range.

At 5,000 samples, even the coldest experts see ~64K tokens — firmly in the diminishing-returns regime. The scale estimates are stable regardless of routing variance.

Why 5,000 and not 30,000

The improvement from 5,000 to 30,000 samples is marginal for max calibration. At 5,000 samples, the worst-case expert already has ~64K tokens — 6× past the ~10K stability threshold. Increasing to 30,000 moves this to ~380K, but the max statistic has already converged well before that point.

The additional calibration quality from 30,000 samples is estimated at 2–5% improvement in scale accuracy for the coldest experts. However, on the hardware used for this quantisation (2× RTX PRO 6000 with CPU offloading), each calibration sample takes ~11 seconds at batch_size=4. This means:

Samples	Compute time	Cold-expert tokens	Marginal quality vs. 5K
5,000	~15 hours	~64K	—
10,000	~30 hours	~128K	~1–2%
30,000	~90 hours	~380K	~2–5%

The 5,000-sample configuration provides the best tradeoff: it reaches well into the diminishing-returns regime while remaining feasible as an overnight run. The quality difference between 5,000 and 30,000 samples is smaller than the quality difference between using 1 generic dataset vs. 5 domain-specific datasets — making dataset diversity the more important lever.

Coverage summary

With 5,000 samples × 4,096 tokens = ~20.5M tokens flowing through top-8/256 routing:

Average tokens per expert: ~640K
Worst-case (cold) expert: ~64K (6× above stability threshold)
Probability of any expert seeing zero tokens: statistically negligible
Chat template applied to all 5,000 samples (all 5 datasets use messages format)
Domains covered: competitive coding, mathematical reasoning, multi-turn instruction following, STEM/chat/safety, function calling/tool use

Known Limitations

Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
Quality may differ from original FP8 on tasks extremely sensitive to numerical precision
Calibration uses natural top-k routing with 100% expert ratio, not forced all-expert activation
Pre-calibrated FP8 KV cache scales are included but SGLang does not currently load them due to a tensor naming mismatch (qkqkv_proj.k_scale vs qkqkv_proj.attn.k_scale). SGLang uses dynamic FP8 scales instead, which still halve KV memory. This will resolve automatically when SGLang updates their M2.7 model definition.
No benchmark comparisons vs FP8 baseline included yet (contributions welcome)
SGLang NVFP4 MoE kernel performance is still maturing; expect improvements in future releases

Quality Validation

Beyond compile-checking and basic inference tests, this quantisation was validated through an autonomous multi-agent coding pipeline (13 sequential tasks) that:

Read a 1,145-line architectural specification
Created 11 Python source files across 4 build phases (schema scaffolding → domain stores → retrieval APIs → unified router)
Produced 2,476 lines of working code — all 11 files compile-clean
Each task involved spec reading, existing code comprehension, file creation, compile verification, and scout validation
The model self-diagnosed and fixed its own import errors in follow-up tasks

Qualitative test results:

Test	Result
Math (25×37)	✅ 925, correct reasoning chain
Code (palindrome)	✅ Correct `.isalnum()` + `[::-1]`
Code (binary search)	✅ Correct with overflow-safe midpoint
Code (merge sort, 1024 tokens)	✅ Coherent, well-commented
Instruction following (5 countries)	✅ Exact format compliance
Autonomous multi-file build (13 tasks)	✅ 12/13 complete, 2,476 lines

License

Same as base model: Modified MIT

Citation

@misc{minimax-m27-nvfp4-2026,
  title={MiniMax-M2.7-NVFP4: High-Calibration-Quality Expert-Only NVFP4 Quantisation},
  author={NinjaBoffin},
  year={2026},
  url={https://huggingface.co/NinjaBoffin/MiniMax-M2.7-NVFP4}
}

Acknowledgements

MiniMax for releasing M2.7 with open weights
NVIDIA Model Optimizer for the quantisation framework
mratsim for expert calibration research and quality analysis
lukealonso for pioneering M2-series NVFP4 quants
voipmonitor/rtx6kpro for Blackwell deployment documentation
Egiazarian et al. for NVFP4 rotation analysis
Cook et al. for Four Over Six adaptive block scaling research

Downloads last month: -

Safetensors

Model size

116B params

Tensor type

BF16

F32

F8_E4M3

Model tree for NinjaBoffin/MiniMax-M2.7-NVFP4

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(58)

this model