MiniMax-M2.7-NVFP4

A high-calibration-quality NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 for NVIDIA Blackwell GPUs.

5,000 calibration samples across 5 diverse datasets. Pre-calibrated FP8 affine KV cache. 100% expert coverage. Attention preserved in BF16. Seeded random sampling. Fully reproducible recipe. Validated at 90 tok/s single-user, 631 tok/s aggregate at N=64 on 2× RTX PRO 6000 Blackwell.

Model Description

MiniMax-M2.7-NVFP4 is an expert-only NVFP4-quantized version of MiniMax-M2.7, a 230B-parameter Mixture-of-Experts language model with 10B active parameters and 256 experts (top-8 routing).

The original FP8 checkpoint was loaded, dequantized to BF16 during calibration, then quantized to NVFP4 (4-bit with blockwise FP8 E4M3 scales per 16 elements) using NVIDIA Model Optimizer (v0.43.0).

What's quantized

Only the MoE expert MLP layers (gate, up, and down projections across all 256 experts) are quantized to NVFP4. Everything else stays in BF16:

  • ✅ Expert MLPs (256 × 3 projections × 62 layers) → NVFP4
  • ❌ Self-attention (Q/K/V/O projections) → BF16 (preserved for coherence and instruction-following quality)
  • ❌ Router/gate weights → BF16
  • ❌ Layer norms → BF16
  • ❌ Embedding / LM head → BF16

This follows NVIDIA's recommendation for MoE models and aligns with research (Egiazarian et al., 2025) demonstrating that attention layers are quality-sensitive under FP4 quantization.

Pre-calibrated FP8 KV Cache

This checkpoint includes pre-calibrated FP8 affine KV cache scales (k_scale and v_scale tensors). This means:

  • Serving frameworks (SGLang, vLLM) can use FP8 KV cache out of the box
  • FP8 KV cache halves memory vs BF16 KV, effectively doubling your context budget
  • No runtime KV scale computation needed — scales were computed during calibration

Motivation and Design Rationale

When quantising a 256-expert MoE model, several properties of the standard quantisation workflow raised concerns that I wanted to address empirically:

1. Calibration sample diversity

Hypothesis: With top-8/256 routing, the majority of experts activate only for specific token distributions. A small, domain-narrow calibration set (e.g., 512 samples of CNN news) may leave many experts with poorly representative activation statistics, resulting in suboptimal FP4 scale factors.

Approach: I calibrated with 5,000 samples drawn from 5 datasets spanning competitive coding, mathematical reasoning, multi-turn instruction following, STEM/chat, and function calling/tool use. At ~20.5M tokens with top-8 routing, each expert sees approximately 640K tokens on average — well past the stability threshold for max calibration scale estimation (see Calibration Sample Count Analysis below).

2. Sequential vs. random sampling

Hypothesis: NVIDIA's default hf_ptq.py takes the first N samples sequentially from each dataset. For datasets sorted by source, difficulty, or topic, this could bias calibration toward a narrow sub-population (e.g., only AIZU problems from OpenCodeReasoning).

Approach: I patched the dataset sampling to use dataset.shuffle(seed=42, buffer_size=10000), drawing samples randomly from a 10K-entry buffer while maintaining full reproducibility via the fixed seed.

3. Expert calibration completeness

Hypothesis: Under natural routing with limited calibration data, rarely-activated experts may not accumulate sufficient activation statistics for accurate scale computation. The moe_calib_experts_ratio parameter controls whether all experts participate in calibration.

Approach: I set moe_calib_experts_ratio=1.0 to ensure all 256 experts are included in FP4 scale computation, regardless of activation frequency during calibration.

4. Attention layer preservation

Hypothesis: Quantising attention layers to FP4 risks degrading coherence, instruction following, and long-range dependency modelling — especially given research showing Hadamard rotation is detrimental for NVFP4 at block size 16 (Egiazarian et al., 2025).

Approach: I used nvfp4_experts_only, which preserves all attention layers in BF16 while quantising only expert MLPs. This mirrors NVIDIA's own strategy for their DeepSeek-R1-NVFP4 checkpoint.

5. Agentic/tool-calling coverage

Hypothesis: M2.7 is designed for agentic tool-calling workflows. Calibrating without function-calling-formatted data means the experts responsible for tool-use token patterns may receive unrepresentative activation statistics.

Approach: I included nvidia/Llama-Nemotron-Post-Training-Dataset which contains function calling, tool use, and reasoning on/off mode switching data — directly relevant to M2.7's primary use case.

Calibration Dataset Composition

Dataset Samples Domain Chat template
nvidia/OpenCodeReasoning 1,000 Competitive coding, reasoning chains
nvidia/OpenMathReasoning 1,000 Mathematical reasoning
Magpie-Align/Magpie-Pro-MT-300K-v0.1 1,000 Multi-turn instruction following
nvidia/Nemotron-Post-Training-Dataset-v2 1,000 STEM, chat, math, code
nvidia/Llama-Nemotron-Post-Training-Dataset 1,000 Function calling, tool use, reasoning on/off

All 5 datasets use the messages format, meaning the tokenizer's apply_chat_template is automatically invoked during preprocessing. This ensures calibration activations match real inference patterns.

Summary of Design Choices vs. Standard Defaults

The table below summarises where this quantisation departs from the typical ModelOpt PTQ defaults and why. The standard defaults are sensible for quick iteration; the choices here prioritise calibration thoroughness for a one-time quantisation intended for long-term use.

Dimension Standard default This release Rationale
Calibration samples 128–512 5,000 10× standard; well past diminishing returns threshold (see analysis below)
Calibration datasets 1–2 generic (e.g., CNN/DailyMail) 5 domain-specific Covers code, math, instruction, STEM, and tool-use
Sampling strategy Sequential (first N entries) Seeded random (seed=42, buffer=10K) Avoids sub-population bias in sorted datasets
Expert coverage Natural routing only 100% (moe_calib_experts_ratio=1.0) Guarantees rarely-activated experts receive calibration
KV cache Uncalibrated (BF16 at runtime) FP8 affine (pre-calibrated scales) Halves KV memory; doubles effective context budget
Attention precision Varies by recipe BF16 (preserved) Protects coherence and instruction-following quality
Sequence length ~512 4096 Captures longer-range activation distributions
Reproducibility Partial or undocumented Full recipe, patches, and seed published Enables independent verification

Hardware Requirements

Configuration VRAM KV cache tokens Context Notes
2× RTX PRO 6000 Blackwell (96GB each) ~70GB/GPU weights ~198K (FP8 KV) 131K Tested and verified
2× RTX PRO 6000 Blackwell (BF16 KV) ~70GB/GPU weights ~99K (BF16 KV) 65K Conservative, no FP8 KV
4× RTX 5090 (32GB each) ~33GB/GPU weights ~2K ~2K Tight, short context only
4× H200 (141GB each) ~33GB/GPU weights Very large Very large Datacenter config
2× B200 (192GB each) ~65GB/GPU weights Very large Very large Datacenter config

NVFP4 requires NVIDIA Blackwell GPUs (SM100/SM120). This model will not run on Hopper (H100/H200), Ada (RTX 4090), or older architectures.

How to Run

SGLang (recommended)

Tested and verified on 2× RTX PRO 6000 Blackwell (96GB GDDR7 each). The configuration below was used for all quality and throughput testing.

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_IB_DISABLE=1
export NCCL_P2P_LEVEL=PHB
export SGLANG_DISABLE_CUDNN_CHECK=1

python -m sglang.launch_server \
    --model-path <path-or-repo>/MiniMax-M2.7-NVFP4 \
    --served-model-name minimax-m2.7 \
    --trust-remote-code \
    --tp 2 --ep 2 \
    --quantization modelopt_fp4 \
    --mem-fraction-static 0.90 \
    --context-length 131072 \
    --max-running-requests 16 \
    --chunked-prefill-size 8192 \
    --kv-cache-dtype fp8_e5m2 \
    --attention-backend fa3 \
    --moe-runner-backend flashinfer_cutlass \
    --disable-custom-all-reduce \
    --enable-flashinfer-allreduce-fusion \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --host 0.0.0.0 \
    --port 8000

Notes on the configuration:

  • --kv-cache-dtype fp8_e5m2: This checkpoint includes pre-calibrated FP8 affine KV cache scales. However, SGLang does not currently load them due to a naming mismatch (qkqkv_proj.k_scale vs qkqkv_proj.attn.k_scale). SGLang falls back to dynamic FP8 scales at runtime, which still halves KV memory vs BF16 and doubles the token pool. Once SGLang updates their M2.7 model definition, the pre-calibrated scales will be used automatically.
  • --attention-backend fa3: Flash Attention 3, optimised for Blackwell (SM120). Benchmarked at ~90 tok/s vs ~88 tok/s with flashinfer. Use flashinfer as a fallback if fa3 is unavailable on your hardware.
  • --context-length 131072: With FP8 KV cache, the server allocates ~198K tokens in the KV pool. Setting context-length to 131K leaves headroom for concurrent requests.
  • --disable-custom-all-reduce and --enable-flashinfer-allreduce-fusion: Required for TP2 stability on Blackwell without NVLink.
  • --chunked-prefill-size 8192: Prevents long prompts from blocking the scheduler when multiple agents send large contexts.
  • You may see "DeepGemm scale_fmt not ue8m0" warnings — these are cosmetic and do not affect output quality.

Verified Performance (2× RTX PRO 6000 Blackwell)

Metric Value
Model size on disk 131GB (vs 215GB FP8 original)
VRAM per GPU (weights) ~70GB
KV cache dtype FP8 E5M2 (dynamic scales)
KV cache token pool ~198K tokens
Context length 131,072 tokens
Decode throughput (single user) ~90 tok/s
Model load time ~27 seconds
CUDA graph capture ~34 seconds

Concurrency Scaling (2× RTX PRO 6000 Blackwell)

Tested with 512-token generation requests at each concurrency level:

Concurrent requests Aggregate tok/s Per-agent tok/s VRAM peak
1 90 90 182.8 GB
4 228 57 182.9 GB
8 354 44 182.8 GB
16 561 35 182.9 GB
32 590 28 182.8 GB
64 631 21 182.7 GB

Peak aggregate throughput: 631 tok/s at N=64. Practical limit (>20 tok/s per agent): N=64. VRAM usage is essentially flat — only ~117 MiB KV growth from N=1 to N=64, meaning the 198K-token KV pool easily accommodates short-to-medium context concurrent requests.

vLLM (untested — theoretical config)

Note: This configuration has not been tested by me, as I use SGLang primarily. It is provided as a starting point based on vLLM's documented NVFP4/ModelOpt support. If you test it, please report results.

export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_NVFP4_GEMM_BACKEND=cutlass

python -m vllm.entrypoints.openai.api_server \
    --model <path-or-repo>/MiniMax-M2.7-NVFP4 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --quantization modelopt \
    --gpu-memory-utilization 0.95 \
    --max-model-len 131072 \
    --kv-cache-dtype fp8_e5m2 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --host 0.0.0.0 \
    --port 8000

Recommended inference parameters

Per MiniMax's guidance:

temperature=1.0
top_p=0.95
top_k=40

Backend Optimization Notes

The following backends were benchmarked on 2× RTX PRO 6000 Blackwell (SM120). The selected configuration represents the optimal combination found so far, but this was surface-level testing only — I tested the most obvious backend combinations and moved on. There are likely further performance unlocks available through deeper kernel tuning, MoE runner experimentation, or serving configuration changes that I haven't explored here. Contributions and findings welcome.

Backend configuration Avg tok/s Notes
flashinfer attn + flashinfer_cutlass MoE + flashinfer_cudnn FP4 88.4 Stable baseline
fa3 attn + flashinfer_cutlass MoE + flashinfer_cudnn FP4 90.0 Selected — +1.8%
flashinfer attn + flashinfer_mxfp4 MoE OOM Incompatible with this model
Baseline + --num-continuous-decode-steps 2 88.8 Negligible gain
  • flashinfer_cudnn is auto-selected by SGLang on Blackwell for FP4 GEMM operations — this is the optimal path designed by NVIDIA for their own hardware.
  • fa3 (Flash Attention 3) provides a small but consistent improvement on Blackwell. Fall back to flashinfer on non-Blackwell GPUs.
  • MoE kernel tuning (triton) does not apply here — flashinfer_cutlass bypasses the triton MoE path entirely.

Baseline SGLang Config (conservative)

If you encounter issues with FA3 or FP8 KV on your hardware, use this proven-stable baseline:

python -m sglang.launch_server \
    --model-path <path-or-repo>/MiniMax-M2.7-NVFP4 \
    --served-model-name minimax-m2.7 \
    --trust-remote-code \
    --tp 2 --ep 2 \
    --quantization modelopt_fp4 \
    --mem-fraction-static 0.90 \
    --context-length 65536 \
    --max-running-requests 16 \
    --chunked-prefill-size 8192 \
    --kv-cache-dtype bf16 \
    --attention-backend flashinfer \
    --moe-runner-backend flashinfer_cutlass \
    --disable-custom-all-reduce \
    --enable-flashinfer-allreduce-fusion \
    --host 0.0.0.0 \
    --port 8000

This uses BF16 KV cache (halves the token pool to ~99K but avoids any FP8 KV compatibility issues) and flashinfer attention (universally supported).

Architecture Quick Reference

Property Value
Total parameters 230B
Active parameters 10B
Experts 256 (top-8 routing)
Layers 62
Hidden size 3,072
Expert intermediate 1,536
Context window 204,800 (native)
Quantized layers Expert MLPs only (NVFP4)
Preserved layers Attention, router, norms, embeddings (BF16)
Quantizers inserted 96,165

Full Quantization Recipe (Reproducible)

I performed this on 2× NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 each) with an AMD Threadripper PRO 9955WX and 256GB DDR5 RAM, but the recipe is not tied to this hardware. You can run it on any setup with enough combined GPU + CPU memory to hold the model during calibration — ModelOpt automatically offloads layers to CPU when they don't fit in VRAM. More VRAM means faster calibration (less offloading), but even a single 24GB GPU with sufficient system RAM should work, just slower. The key requirement is enough total memory (GPU + CPU) to hold the FP8/BF16 weights (~215GB) plus calibration activations.

Environment setup

conda create -n quantize python=3.12 -y
conda activate quantize
pip install "nvidia-modelopt[all]" torch transformers accelerate sentencepiece protobuf hf_transfer
git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer
pip install -e ".[all]"

Patches applied to NVIDIA Model Optimizer

I applied two patches to the Model Optimizer source. Both address practical issues encountered during quantisation of large MoE models on consumer multi-GPU setups.

1. Randomised calibration sampling (dataset_utils.py line 303):

# After: dataset_splits = [load_dataset(streaming=True, **config, split=s) for s in splits]
# Added:
dataset_splits = [ds.shuffle(seed=42, buffer_size=10000) for ds in dataset_splits]

The default implementation takes the first N samples sequentially. This patch randomises within a 10K-entry streaming buffer with a fixed seed, ensuring diverse sampling while maintaining reproducibility.

2. CPU export compatibility (huggingface.py lines 1103, 1116):

# Changed:
# with torch.cuda.device(self.weight.device):
# To:
with (torch.cuda.device(self.weight.device) if self.weight.device.type == "cuda" else torch.no_grad()):

When model weights are split across GPU and CPU (common on 2-GPU setups where the full model exceeds total VRAM), the default HuggingFace export function assumes all weights are on CUDA. This patch handles CPU-offloaded layers gracefully.

Quantization command

export HF_HUB_TRUST_REMOTE_CODE=1

python hf_ptq.py \
    --pyt_ckpt_path /mnt/models/MiniMax-M2.7 \
    --qformat nvfp4_experts_only \
    --export_fmt hf \
    --export_path /mnt/models/MiniMax-M2.7-NVFP4 \
    --trust_remote_code \
    --dataset "open_code_reasoning,open_math_reasoning,magpie,nemotron-post-training-dataset-v2,llama-nemotron-post-training-dataset" \
    --calib_size "1000,1000,1000,1000,1000" \
    --calib_seq 4096 \
    --kv_cache_qformat fp8_affine \
    --moe_calib_experts_ratio 1.0 \
    --inference_tensor_parallel 2 \
    --batch_size 4 \
    --skip_generate \
    --verbose

Key quantization parameters

Parameter Value Rationale
qformat nvfp4_experts_only Only expert MLPs quantised; attention, routers, norms stay BF16
calib_size 1000 × 5 datasets = 5,000 10× standard; past diminishing returns for max calibration (see analysis)
calib_seq 4096 Captures longer-range activation patterns vs. default 512
kv_cache_qformat fp8_affine Pre-calibrated per-channel affine FP8 KV scales included in checkpoint
moe_calib_experts_ratio 1.0 All 256 experts participate in calibration
inference_tensor_parallel 2 Model split across 2 GPUs for calibration forward passes
batch_size 4 Parallel calibration; 4× throughput vs. default batch_size=1
Shuffle patch seed=42, buffer_size=10000 Randomised sampling with reproducibility

Architecture details

Model: MiniMaxM2ForCausalLM
Total parameters: 230B
Active parameters per token: 10B
Experts per layer: 256
Experts per token: 8 (top-k routing)
Shared experts: None
Layers: 62
Hidden size: 3072
Context length: 196,608 tokens
Quantizers inserted: 96,165
Attention: BF16 (not quantised)
Expert MLPs: NVFP4 (E2M1 + FP8 E4M3 block scales + FP32 tensor scales)
KV cache: FP8 affine (pre-calibrated k_scale/v_scale tensors)

Calibration Sample Count Analysis

Choosing an appropriate calibration sample count for a 256-expert MoE model requires balancing statistical coverage against compute cost. I arrived at 5,000 samples through the following analysis.

How max calibration works

NVIDIA's ModelOpt uses max calibration by default: for each quantised tensor, it records the maximum absolute activation value observed during calibration, then sets the FP4 scale factor to map that maximum to the largest representable FP4 value. The accuracy of the resulting scale factor therefore depends on how well the observed maximum approximates the true maximum of the activation distribution.

The sample maximum of a distribution converges logarithmically — doubling the sample count produces smaller and smaller improvements. The critical question is: how many tokens must each expert see before its observed maximum stabilises?

Per-expert token coverage at different sample counts

For M2.7 (256 experts, top-8 routing, 4096 tokens per sample), the average tokens per expert at different calibration sizes:

Total samples Total tokens Avg tokens/expert Worst-case expert (~10× below avg)
512 (standard) 2.1M 65K ~6.5K
1,500 6.1M 192K ~19.2K
5,000 20.5M 640K ~64K
30,000 123M 3.8M ~380K

MoE routing follows a Zipf-like distribution — popular experts may fire 10× more than cold ones. The "worst-case expert" column estimates the token count for an expert at the 10th percentile of activation frequency.

Stability thresholds for max calibration

Based on the properties of order statistics for heavy-tailed distributions (typical of neural network activations):

  • ~100 activations: Scale estimate is noisy; outlier-sensitive. Cold experts at this level risk poor FP4 scales.
  • ~1,000 activations: Scale estimate stabilises for most weight distributions. Adequate for a rough quantisation.
  • ~10,000 activations: The observed maximum has almost certainly captured the true tail of the distribution. Diminishing returns begin here.
  • ~100,000+ activations: Marginal improvement is negligible. Additional samples refine the scale estimate by fractions of a percent.

Why 5,000 and not 512

At the standard 512 samples, the coldest experts see ~6.5K tokens — just past the stability threshold, with no margin. Any routing imbalance or domain mismatch in the calibration data could push individual experts below the threshold, resulting in FP4 scales that clip or underutilise the representable range.

At 5,000 samples, even the coldest experts see ~64K tokens — firmly in the diminishing-returns regime. The scale estimates are stable regardless of routing variance.

Why 5,000 and not 30,000

The improvement from 5,000 to 30,000 samples is marginal for max calibration. At 5,000 samples, the worst-case expert already has ~64K tokens — 6× past the ~10K stability threshold. Increasing to 30,000 moves this to ~380K, but the max statistic has already converged well before that point.

The additional calibration quality from 30,000 samples is estimated at 2–5% improvement in scale accuracy for the coldest experts. However, on the hardware used for this quantisation (2× RTX PRO 6000 with CPU offloading), each calibration sample takes ~11 seconds at batch_size=4. This means:

Samples Compute time Cold-expert tokens Marginal quality vs. 5K
5,000 ~15 hours ~64K
10,000 ~30 hours ~128K ~1–2%
30,000 ~90 hours ~380K ~2–5%

The 5,000-sample configuration provides the best tradeoff: it reaches well into the diminishing-returns regime while remaining feasible as an overnight run. The quality difference between 5,000 and 30,000 samples is smaller than the quality difference between using 1 generic dataset vs. 5 domain-specific datasets — making dataset diversity the more important lever.

Coverage summary

With 5,000 samples × 4,096 tokens = ~20.5M tokens flowing through top-8/256 routing:

  • Average tokens per expert: ~640K
  • Worst-case (cold) expert: ~64K (6× above stability threshold)
  • Probability of any expert seeing zero tokens: statistically negligible
  • Chat template applied to all 5,000 samples (all 5 datasets use messages format)
  • Domains covered: competitive coding, mathematical reasoning, multi-turn instruction following, STEM/chat/safety, function calling/tool use

Known Limitations

  • Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
  • Quality may differ from original FP8 on tasks extremely sensitive to numerical precision
  • Calibration uses natural top-k routing with 100% expert ratio, not forced all-expert activation
  • Pre-calibrated FP8 KV cache scales are included but SGLang does not currently load them due to a tensor naming mismatch (qkqkv_proj.k_scale vs qkqkv_proj.attn.k_scale). SGLang uses dynamic FP8 scales instead, which still halve KV memory. This will resolve automatically when SGLang updates their M2.7 model definition.
  • No benchmark comparisons vs FP8 baseline included yet (contributions welcome)
  • SGLang NVFP4 MoE kernel performance is still maturing; expect improvements in future releases

Quality Validation

Beyond compile-checking and basic inference tests, this quantisation was validated through an autonomous multi-agent coding pipeline (13 sequential tasks) that:

  1. Read a 1,145-line architectural specification
  2. Created 11 Python source files across 4 build phases (schema scaffolding → domain stores → retrieval APIs → unified router)
  3. Produced 2,476 lines of working code — all 11 files compile-clean
  4. Each task involved spec reading, existing code comprehension, file creation, compile verification, and scout validation
  5. The model self-diagnosed and fixed its own import errors in follow-up tasks

Qualitative test results:

Test Result
Math (25×37) ✅ 925, correct reasoning chain
Code (palindrome) ✅ Correct .isalnum() + [::-1]
Code (binary search) ✅ Correct with overflow-safe midpoint
Code (merge sort, 1024 tokens) ✅ Coherent, well-commented
Instruction following (5 countries) ✅ Exact format compliance
Autonomous multi-file build (13 tasks) ✅ 12/13 complete, 2,476 lines

License

Same as base model: Modified MIT

Citation

@misc{minimax-m27-nvfp4-2026,
  title={MiniMax-M2.7-NVFP4: High-Calibration-Quality Expert-Only NVFP4 Quantisation},
  author={NinjaBoffin},
  year={2026},
  url={https://huggingface.co/NinjaBoffin/MiniMax-M2.7-NVFP4}
}

Acknowledgements

Downloads last month
-
Safetensors
Model size
116B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NinjaBoffin/MiniMax-M2.7-NVFP4

Quantized
(58)
this model