FP8 Block-Quantized RedHatAI/Kimi-K2.6-FP8-BLOCK

This is a preliminary version (and subject to change) of FP8 block-quantized moonshotai/Kimi-K2.6 model, compatible with DeepGEMM fp8 kernels (must be installed separately). The model has both weights and activations quantized to FP8 format with vllm-project/llm-compressor.

It is compatible and tested against vllm v0.20.0. Deploy it via vllm serve using the recipes at https://recipes.vllm.ai/moonshotai/Kimi-K2.6.

Creation Script:

Kimi K2.6 support will land in https://github.com/vllm-project/llm-compressor/pull/2662. The script to create the checkpoint can be seen below:

from compressed_tensors.entrypoints.convert import CompressedTensorsDequantizer

from llmcompressor import model_free_ptq

# moonshotai/Kimi-K2.6 checkpoint is published in compressed-tensors format.
# This script will upconvert to bfloat16 so that the model can be compressed
# to FP8_BLOCK

MODEL_ID = "moonshotai/Kimi-K2.6"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-BLOCK"

ignore = [
    "re:.*mlp.gate$",
    "re:.*lm_head",
    "re:.*kv_a_proj_with_mqa$",
    "re:.*q_a_proj$",
    "re:.*vision_tower.*",
    "re:.*embed_tokens$",
    "re:.*norm$",
    # ignore anything not in language_model
    "re:.*mm_projector.*",
    "re:.*vision.*",
]

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="FP8_BLOCK",
    ignore=ignore,
    converter=CompressedTensorsDequantizer(
        MODEL_ID,
        quant_config_key="text_config.quantization_config",
        ignore=ignore,
    ),
    max_workers=2,
    device="cuda:0",
)

Preliminary Evaluations

GSM8K Platinum:

lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=RedHatAI/Kimi-K2.6-FP8-BLOCK,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
  --num_fewshot 0 \
  --apply_chat_template \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678"

Recovery:

	moonshotai/Kimi-K2.6 (original in W4A16)	RedHatAI/Kimi-K2.6-FP8-BLOCK (this model)
Accuracy	94.29	93.55
Recovery	-	99.2%

Note: More rigorous evaluations are currently in progress and will be available soon.

Downloads last month: 1,557

Safetensors

Model size

1T params

Tensor type

BF16

F8_E4M3

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RedHatAI/Kimi-K2.6-FP8-BLOCK

Base model

moonshotai/Kimi-K2.6

Quantized

(30)

this model

Evaluation results

Swe Bench Resolved on SWE-bench/SWE-bench_Verified View evaluation results

source leaderboard

80.2
SWE Bench Pro on ScaleAI/SWE-bench_Pro View evaluation results

source leaderboard

58.6
Diamond on Idavidrein/gpqa View evaluation results

source leaderboard

90.5
MathArena Hmmt Feb 2026 on MathArena/hmmt_feb_2026 View evaluation results

source leaderboard

92.7
Terminalbench 2 on harborframework/terminal-bench-2.0 View evaluation results

source leaderboard

66.7
MathArena Aime 2026 on MathArena/aime_2026 View evaluation results

source leaderboard

96.4
Hle
on cais/hle View evaluation results

source

34.7
on cais/hle View evaluation results

source

With tools
54 ^*