Quantization Deep Dive
Overview
We explored multiple quantization strategies for LeWM. This document records what was tried, why it worked or didn't, and the engineering trade-offs.
Formats Evaluated
| Format | Bits/weight | Compression | Cos vs f32 | Status |
|---|---|---|---|---|
| INT8 (encoder) | 8 | 4x | 0.9999 | Production |
| Q4 (predictor) | 4 | 2x more | 0.998 | Production |
| INT8+Q4 (full) | mixed | 5x total | 0.999 | Production |
| Q4 encoder only | 4 | 4x | 0.93 | Rejected |
| Ternary {-1,0,+1} | 2 | 8x | ~0.85 | Rejected |
| WANDA 20% | sparse Q4 | 20% fewer | ~0.99 | Experimental |
| WANDA 40% | sparse Q4 | 40% fewer | ~0.97 | Experimental |
INT8 Per-Channel (Encoder)
Method
f32_weight [out, in] β per-output-channel quantization
β int8_weight [out, in]
β per-output-channel scales [out]
At inference:
result = input @ dequant(weight) Γ scales
= input @ (int8_weight + zero_point) Γ scales
= (input @ int8_weight) Γ scales // zero_point = 0 for symmetric
Why It Works
- Per-channel preserves channel-level statistics. Each output channel gets its own scale.
- Symmetric (zero_point = 0) avoids the overhead of asymmetric quantization.
- INT8 GEMV on PIE SIMD: 16-wide multiply-accumulate per cycle.
- Encoder activations have predictable dynamic range after LayerNorm.
Quality
| Layer | cos vs f32 |
|---|---|
| patch_embed | 0.9999 |
| encoder layer 0 | 0.9999 |
| encoder layer 5 | 0.9998 |
| encoder.proj | 0.9999 |
| Total | 0.9999 |
Engineering Notes
- Activation quantization is dynamic (per-row at inference time), not static.
- QKV shared quantization: same normalized input quantized once for Q, K, V.
- Scales stored as f32 (4 bytes per channel) β negligible overhead.
Q4 Block (Predictor)
Method
f32_weight [out, in] β per-32-element-block quantization
β nibble-packed weight data
β per-block f16 scales
At inference:
for each block of 32:
unpack nibbles β int8 [-8, 7]
dot = simd_dot(input_block, unpacked)
result += dot Γ block_scale
Why It Works
- Per-block (32 elements) matches the predictor's adaLN normalization.
- adaLN modulation provides implicit normalization β weights don't need per-channel precision.
- Smaller layers = less error accumulation. 4-layer predictor vs 6-layer encoder.
- PIE SIMD handles the nibble unpack + dot in tight loops.
Why Full Q4 (Encoder + Predictor) Fails
When we skip INT8 and quantize the encoder to Q4:
| Layer | INT8 cos | Q4 cos |
|---|---|---|
| encoder | 0.9999 | 0.93 |
| predictor | 0.998 | 0.998 |
| Total | 0.999 | 0.93 |
Root cause: ViT encoder has high dynamic range in intermediate activations. The 32-element block granularity doesn't align with the encoder's channel statistics. INT8's per-channel precision is essential.
Engineering Notes
- Nibbles decode as
value - 8β range [-8, 7] - Block scales stored as f16 (2 bytes per block) β 64 bytes per 32Γ32 block
- Zero weights are rare in Q4 (<1%) β skip-zero optimization not implemented
Ternary ({-1, 0, +1})
Method
f32_weight β hard threshold
β +1 if w > +tau
β -1 if w < -tau
β 0 otherwise
β bit-packed ternary
At inference:
result = input @ ternary_weight
= sum(sign(w_i) Γ x_i) // pure addition/subtraction
Why It Fails
| Metric | Q4 | Ternary |
|---|---|---|
| Cos vs f32 | 0.998 | ~0.85 |
| Compression | 8x vs f32 | 16x vs f32 |
Root cause: adaLN generates 6 modulation vectors (scale1, shift1, gate1, scale2, shift2, gate2) that multiply and add to the normalized activations. The magnitudes of these modulation vectors matter β ternary destroys them.
What was tried:
- Various thresholds (Ο = 0.5Ο, Ο, 2Ο)
- STE (straight-through estimator) during fine-tuning
- Mixed ternary (ternary weights + fp32 scales)
None recovered quality sufficiently.
WANDA Pruning
Method
1. Forward pass on calibration set β collect activations
2. Compute WANDA score: s(w) = |w| Γ ||a||
3. Sort scores, prune bottom N%
4. Fine-tune 1 epoch
5. Re-quantize remaining weights to Q4
Results
| Model | Pruned | Size | Cos | Notes |
|---|---|---|---|---|
| Baseline Q4 | 0% | 23.6 MB | 0.998 | Reference |
| WANDA 20% | 20% | 22.0 MB | ~0.99 | Pruned, no fine-tune |
| WANDA 40% | 40% | 25.1 MB | ~0.97 | Bitmap overhead exceeds savings |
Engineering Notes
- 40% pruned model has larger binary than 20% due to bitmap overhead
- Skip-zero GEMV needs hardware support (not implemented in PIE SIMD)
- Would benefit from fine-tuning after pruning
Shared Lessons
Per-channel > per-block for encoders: Encoders have high per-channel variance. INT8's per-channel precision beats Q4's per-block.
Predictors are quantization-friendly: adaLN provides implicit normalization. Predictors can use Q4 with minimal quality loss.
Architecture changes beat quantization: hybrid_ALAL (3.0M params) achieves similar quality to slim_96d (10.2M params) at INT8+Q4. Architecture > precision.
Epoch 1 models have headroom: All current slim/epoch-1 models will improve with longer training. The quality comparisons should be re-run at convergence.
Hardwired is the limit: Q4 weights decompose to shift-add operations. Zero multiplications, zero memory fetches. That's the theoretical floor.