Quantization Deep Dive

Overview

We explored multiple quantization strategies for LeWM. This document records what was tried, why it worked or didn't, and the engineering trade-offs.

Formats Evaluated

Format	Bits/weight	Compression	Cos vs f32	Status
INT8 (encoder)	8	4x	0.9999	Production
Q4 (predictor)	4	2x more	0.998	Production
INT8+Q4 (full)	mixed	5x total	0.999	Production
Q4 encoder only	4	4x	0.93	Rejected
Ternary {-1,0,+1}	2	8x	~0.85	Rejected
WANDA 20%	sparse Q4	20% fewer	~0.99	Experimental
WANDA 40%	sparse Q4	40% fewer	~0.97	Experimental

INT8 Per-Channel (Encoder)

Method

f32_weight [out, in] → per-output-channel quantization
→ int8_weight [out, in]
→ per-output-channel scales [out]

At inference:
  result = input @ dequant(weight) × scales
         = input @ (int8_weight + zero_point) × scales
         = (input @ int8_weight) × scales   // zero_point = 0 for symmetric

Why It Works

Per-channel preserves channel-level statistics. Each output channel gets its own scale.
Symmetric (zero_point = 0) avoids the overhead of asymmetric quantization.
INT8 GEMV on PIE SIMD: 16-wide multiply-accumulate per cycle.
Encoder activations have predictable dynamic range after LayerNorm.

Quality

Layer	cos vs f32
patch_embed	0.9999
encoder layer 0	0.9999
encoder layer 5	0.9998
encoder.proj	0.9999
Total	0.9999

Engineering Notes

Activation quantization is dynamic (per-row at inference time), not static.
QKV shared quantization: same normalized input quantized once for Q, K, V.
Scales stored as f32 (4 bytes per channel) — negligible overhead.

Q4 Block (Predictor)

Method

f32_weight [out, in] → per-32-element-block quantization
→ nibble-packed weight data
→ per-block f16 scales

At inference:
  for each block of 32:
    unpack nibbles → int8 [-8, 7]
    dot = simd_dot(input_block, unpacked)
    result += dot × block_scale

Why It Works

Per-block (32 elements) matches the predictor's adaLN normalization.
adaLN modulation provides implicit normalization — weights don't need per-channel precision.
Smaller layers = less error accumulation. 4-layer predictor vs 6-layer encoder.
PIE SIMD handles the nibble unpack + dot in tight loops.

Why Full Q4 (Encoder + Predictor) Fails

When we skip INT8 and quantize the encoder to Q4:

Layer	INT8 cos	Q4 cos
encoder	0.9999	0.93
predictor	0.998	0.998
Total	0.999	0.93

Root cause: ViT encoder has high dynamic range in intermediate activations. The 32-element block granularity doesn't align with the encoder's channel statistics. INT8's per-channel precision is essential.

Engineering Notes

Nibbles decode as value - 8 → range [-8, 7]
Block scales stored as f16 (2 bytes per block) — 64 bytes per 32×32 block
Zero weights are rare in Q4 (<1%) — skip-zero optimization not implemented

Ternary ({-1, 0, +1})

Method

f32_weight → hard threshold
→ +1 if w > +tau
→ -1 if w < -tau
→  0 otherwise
→ bit-packed ternary

At inference:
  result = input @ ternary_weight
         = sum(sign(w_i) × x_i)   // pure addition/subtraction

Why It Fails

Metric	Q4	Ternary
Cos vs f32	0.998	~0.85
Compression	8x vs f32	16x vs f32

Root cause: adaLN generates 6 modulation vectors (scale1, shift1, gate1, scale2, shift2, gate2) that multiply and add to the normalized activations. The magnitudes of these modulation vectors matter — ternary destroys them.

What was tried:

Various thresholds (τ = 0.5σ, σ, 2σ)
STE (straight-through estimator) during fine-tuning
Mixed ternary (ternary weights + fp32 scales)

None recovered quality sufficiently.

WANDA Pruning

Method

1. Forward pass on calibration set → collect activations
2. Compute WANDA score: s(w) = |w| × ||a||
3. Sort scores, prune bottom N%
4. Fine-tune 1 epoch
5. Re-quantize remaining weights to Q4

Results

Model	Pruned	Size	Cos	Notes
Baseline Q4	0%	23.6 MB	0.998	Reference
WANDA 20%	20%	22.0 MB	~0.99	Pruned, no fine-tune
WANDA 40%	40%	25.1 MB	~0.97	Bitmap overhead exceeds savings

Engineering Notes

40% pruned model has larger binary than 20% due to bitmap overhead
Skip-zero GEMV needs hardware support (not implemented in PIE SIMD)
Would benefit from fine-tuning after pruning

Shared Lessons

Per-channel > per-block for encoders: Encoders have high per-channel variance. INT8's per-channel precision beats Q4's per-block.
Predictors are quantization-friendly: adaLN provides implicit normalization. Predictors can use Q4 with minimal quality loss.
Architecture changes beat quantization: hybrid_ALAL (3.0M params) achieves similar quality to slim_96d (10.2M params) at INT8+Q4. Architecture > precision.
Epoch 1 models have headroom: All current slim/epoch-1 models will improve with longer training. The quality comparisons should be re-run at convergence.
Hardwired is the limit: Q4 weights decompose to shift-add operations. Zero multiplications, zero memory fetches. That's the theoretical floor.