mmBERT-L4H384 / mmBERT-L7H384 / mmBERT-L13H384

Pruned variants of mmBERT-small.

Models

⚠️ Note: Pruning-Only (Not Distilled)

These are pruning-only variants—we simply remove layers without any knowledge distillation or fine-tuning. Fully trained or distilled models with the same architecture may outperform these pruned versions.

Overview

These models are created by layer pruning from mmBERT-small (22 layers, 384 hidden dimensions). We select specific layers to retain while preserving the ModernBERT global/local attention cadence.

Layer Selection and Evaluation

We fine-tuned the pruned models for information retrieval on the MS MARCO dataset and evaluated them on nanoBEIR (NDCG@10).

The numbers in model names (e.g., 0_1_2_18) indicate which layers are retained from the original 22-layer model:

  • L4H384 (0_1_2_18): Keeps layers 0, 1, 2, and 18 → 4 layers total
  • L7H384 (0_1_2_3_4_5_18): Keeps layers 0–5 and 18 → 7 layers total
  • L13H384 (0_1_2_3_4_5_6_7_8_9_10_11_18): Keeps layers 0–11 and 18 → 13 layers total

Why These Configurations?

We chose these "official" configurations based on two criteria:

  1. Simplicity: Consecutive layer indices (0, 1, 2, 3, ...) are easier to understand and reproduce than scattered indices like 0_1_2_3_6_8_18.

  2. Competitive performance: While not always the absolute best score, these configurations perform competitively within their layer count category.

For example, L7H384-0_1_2_3_6_8_18 (mean: 0.4722) slightly outperforms our official pick L7H384-0_1_2_3_4_5_18 (mean: 0.4693), but the consecutive layer pattern is more interpretable and the performance difference is marginal.

Why Layer 18?

ModernBERT uses an alternating attention pattern:

  • Global attention (g): Full self-attention across all tokens
  • Local attention (l): Attention within a sliding window

The pattern follows a g-l-l-g-l-l-... rhythm. In the original 22-layer mmBERT-small, both layer 18 and layer 21 are global attention layers, with layer 21 being the final layer.

However, our experiments showed that ending with layer 18 consistently outperforms ending with layer 21. For example:

  • L4H384-0_1_2_18 (mean: 0.4530) vs L4H384-0_1_2_21 (mean: 0.4558)
  • L7H384-0_1_2_3_4_5_18 (mean: 0.4693) vs L7H384-0_1_2_3_4_5_21 (mean: 0.4629)

This suggests that the representations at layer 18 are more effective for retrieval tasks when combined with early layers, possibly because layer 18 provides a better balance between abstraction and retention of fine-grained information.

Experimental Variations

We explored different pruning strategies by shifting the start positions and coverage:

  • Front-heavy (e.g., 0_1_2_3_4_5_18): Retains early layers, skips middle layers
  • Back-heavy (e.g., 0_16_17_18_19_20_21): Retains later layers
  • Distributed (e.g., 0_1_2_3_4_5_6_7_8_10_12_15_18): Spreads retained layers across depth

This probes the trade-off between depth (how many layers) and coverage (which parts of the network contribute).

Scores (NDCG@10) — All L4/L7/L13 Runs

model mean NanoArguAna NanoClimateFEVER NanoDBPedia NanoFEVER NanoFiQA2018 NanoHotpotQA NanoMSMARCO NanoNFCorpus NanoNQ NanoQuoraRetrieval NanoSCIDOCS NanoSciFact NanoTouche2020
mmBERT-small (22 layers) 0.5151 0.4345 0.2888 0.4548 0.7534 0.4199 0.6629 0.5853 0.2849 0.5634 0.9367 0.2704 0.5042 0.5378
L13H384-0_1_2_3_4_5_6_7_8_9_10_11_12 0.4553 0.3908 0.2715 0.4385 0.7290 0.3289 0.6191 0.4702 0.2178 0.4649 0.9198 0.2152 0.4402 0.4129
L13H384-0_1_2_3_4_5_6_7_8_9_10_11_15 0.4576 0.4395 0.2457 0.4284 0.7472 0.3237 0.5920 0.4918 0.2199 0.4531 0.9195 0.1852 0.4820 0.4208
L13H384-0_1_2_3_4_5_6_7_8_9_10_11_18 0.4964 0.4462 0.2955 0.4907 0.7564 0.3886 0.6469 0.5142 0.2644 0.5268 0.9412 0.2326 0.4840 0.4662
L13H384-0_1_2_3_4_5_6_7_8_9_10_11_21 0.4800 0.4162 0.2858 0.4695 0.7197 0.3358 0.6338 0.5512 0.2603 0.5127 0.9305 0.2389 0.4457 0.4393
L13H384-0_1_2_3_4_5_6_7_8_9_10_12_18 0.4904 0.4594 0.2619 0.4904 0.7481 0.3832 0.6552 0.5476 0.2540 0.5092 0.9183 0.2411 0.4518 0.4551
L13H384-0_1_2_3_4_5_6_7_8_9_12_15_18 0.4791 0.4401 0.2754 0.4849 0.7384 0.3201 0.6369 0.5059 0.2478 0.5237 0.9190 0.2602 0.4666 0.4099
L13H384-0_1_2_3_4_5_6_7_8_10_12_15_18 0.4877 0.4384 0.2749 0.4937 0.7299 0.3366 0.6698 0.5314 0.2588 0.5073 0.9264 0.2430 0.4879 0.4414
L13H384-0_1_2_3_4_5_6_7_9_12_15_18_21 0.4810 0.4007 0.2739 0.4989 0.7180 0.3403 0.6441 0.5257 0.2541 0.5093 0.9187 0.2413 0.4905 0.4371
L13H384-0_10_11_12_13_14_15_16_17_18_19_20_21 0.4806 0.3938 0.2855 0.4911 0.7974 0.3504 0.6034 0.5211 0.2361 0.4486 0.9144 0.2257 0.4508 0.5294
L13H384-9_10_11_12_13_14_15_16_17_18_19_20_21 0.4307 0.3901 0.2621 0.4753 0.7185 0.2927 0.5371 0.4487 0.2361 0.3267 0.8605 0.1513 0.3860 0.5143
L7H384-0_1_2_3_4_5_6 0.4291 0.3635 0.2839 0.4665 0.6299 0.2958 0.5433 0.4692 0.1841 0.4174 0.8800 0.2217 0.4570 0.3660
L7H384-0_1_2_3_4_5_9 0.4282 0.3929 0.2719 0.4447 0.6674 0.2890 0.5192 0.4847 0.2226 0.3850 0.8870 0.2074 0.4145 0.3804
L7H384-0_1_2_3_4_5_12 0.4204 0.4035 0.2501 0.4283 0.6245 0.3044 0.5350 0.4518 0.1900 0.3760 0.8763 0.2073 0.4438 0.3748
L7H384-0_1_2_3_4_5_18 0.4693 0.3879 0.2782 0.5046 0.7257 0.3631 0.6139 0.4633 0.2353 0.4623 0.8951 0.2310 0.5111 0.4296
L7H384-0_1_2_3_4_5_21 0.4629 0.4331 0.2731 0.4958 0.7667 0.3368 0.5943 0.4194 0.2666 0.4428 0.8742 0.2542 0.4220 0.4386
L7H384-0_1_2_3_6_7_8 0.4236 0.3903 0.2590 0.4613 0.6097 0.2692 0.5962 0.4556 0.1790 0.3755 0.8501 0.2157 0.4596 0.3850
L7H384-0_1_2_3_6_7_12 0.4149 0.3752 0.2369 0.4489 0.5763 0.2798 0.5630 0.4600 0.1955 0.3881 0.8458 0.2303 0.4260 0.3671
L7H384-0_1_2_3_6_8_12 0.4171 0.3215 0.2305 0.4491 0.5696 0.2803 0.5615 0.4959 0.1897 0.3790 0.8756 0.2313 0.4600 0.3787
L7H384-0_1_2_3_6_8_18 0.4722 0.3988 0.2619 0.5002 0.7551 0.3186 0.6438 0.5024 0.2429 0.4259 0.8969 0.2162 0.5054 0.4704
L7H384-0_16_17_18_19_20_21 0.4589 0.3684 0.2711 0.4949 0.7224 0.3087 0.5750 0.4676 0.2317 0.4541 0.8829 0.2050 0.4668 0.5171
L7H384-15_16_17_18_19_20_21 0.4299 0.3728 0.2747 0.4572 0.6557 0.2529 0.5594 0.4474 0.2197 0.3528 0.8883 0.1887 0.4160 0.5034
L4H384-0_1_2_3 0.3329 0.2011 0.1529 0.4820 0.3088 0.1937 0.4178 0.3890 0.1897 0.3238 0.8441 0.2045 0.2912 0.3286
L4H384-0_1_2_18 0.4530 0.3806 0.2544 0.4657 0.7230 0.2793 0.5704 0.5060 0.2270 0.4283 0.8942 0.2246 0.4671 0.4682
L4H384-0_1_2_21 0.4558 0.3801 0.2553 0.4871 0.7350 0.3097 0.5734 0.4899 0.2510 0.4193 0.8860 0.2249 0.4620 0.4517
L4H384-0_19_20_21 0.4408 0.3888 0.2651 0.4880 0.6629 0.3018 0.6010 0.4224 0.2342 0.4086 0.8714 0.2027 0.4238 0.4597
L4H384-18_19_20_21 0.4130 0.3067 0.2546 0.4740 0.6206 0.2363 0.5393 0.4074 0.2233 0.2879 0.8850 0.2015 0.4270 0.5058

Bold rows indicate the official picks for each layer count.

Key Findings

  1. Front-heavy pruning works best: Retaining early layers (0–N) plus a global attention layer consistently outperforms other strategies.

  2. Layer 18 > Layer 21: Ending with layer 18 (global attention) outperforms ending with layer 21 (final global attention layer), suggesting that intermediate global attention layers provide better representations for retrieval when combined with early layers.

  3. Early layers are critical: Models that skip early layers (e.g., 9_10_11_... or 15_16_17_...) show significant performance degradation.

  4. Diminishing returns with depth: L13 (0.496) vs L7 (0.469) shows only ~3% improvement for nearly double the layers.

License

MIT

Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hotchpotch/mmBERT-L4H384-pruned

Finetuned
(25)
this model