Great custom quant!

#1
by Mushoz - opened

This IQ4_XS quant seems to have an excellent size-to-performance ratio, thanks! Would you mind sharing your quant recipe / regex that you used? I am seeing:

ffn_down_exps.weight -> IQ4_XS
ffn_gate_exps.weight -> IQ3_S
ffn_up_exps.weight -> IQ3_S

And how did you get to these particular choices? I am trying to learn as much as I can. Thanks! :)

@Mushoz Thanks!

You've got the recipe correct there, yes. @ddh0 and I found that quantizing the routed experts and leaving the rest of the model in Q8_0 leads to better KLD and long-term context performance compared to normal llama.cpp quantization recipes which quant the entire model to various degrees. Eg, a regular Q4_K_M quant would quant the attention tensors too, etc.

Because these large MoE's have most of their model size in the routed experts though, you're looking at like 85%+ of the model is just in those conditional weights. Overall size wise going from Q8 to Q4 for the non-routed-experts is shaving only a few GB off the entire model but you're trading a lot of accuracy for that.

Basically, it's better to quant the sparse areas instead which are the routed experts. The UP and GATE tensors are a little less quantization sensitive than the DOWN tensors are from experimentation, so it's recommended to keep the DOWN tensors +1 quantization level in quality compared to the other two. There are also some fused optimizations that can be used when UP and GATE are the same quantization level, so you typically want to keep those the same.

The recipe for the IQ4_XS is as you say:

MIX=IQ4_XS
TYPE_FFN_UP_EXPS=IQ3_S
TYPE_FFN_GATE_EXPS=IQ3_S
TYPE_FFN_DOWN_EXPS=IQ4_XS
TYPE_DEFAULT=Q8_0

and I have a quantization script that takes those recipes and produces quants from them (lots of variable replacement in the script):

./build/bin/llama-quantize \
    --tensor-type ffn_up_exps=$TYPE_FFN_UP_EXPS \
    --tensor-type ffn_gate_exps=$TYPE_FFN_GATE_EXPS \
    --tensor-type ffn_down_exps=$TYPE_FFN_DOWN_EXPS \
    --imatrix $imatrix $gguf $output_filename $TYPE_DEFAULT

The total model BPW comes in at just about what an IQ4_XS is so I named it that, even though it isn't 100% accurate. The other quant recipes are similar but I've misplaced their precise recipes, it's a similar pattern though and you could inspect the ggufs to recreate it.

Sign up or log in to comment