mmBERT-L4H384 / mmBERT-L7H384 / mmBERT-L13H384
Pruned variants of mmBERT-small.
Models
⚠️ Note: Pruning-Only (Not Distilled)
These are pruning-only variants—we simply remove layers without any knowledge distillation or fine-tuning. Fully trained or distilled models with the same architecture may outperform these pruned versions.
Overview
These models are created by layer pruning from mmBERT-small (22 layers, 384 hidden dimensions). We select specific layers to retain while preserving the ModernBERT global/local attention cadence.
Layer Selection and Evaluation
We fine-tuned the pruned models for information retrieval on the MS MARCO dataset and evaluated them on nanoBEIR (NDCG@10).
The numbers in model names (e.g., 0_1_2_18) indicate which layers are retained from the original 22-layer model:
- L4H384 (0_1_2_18): Keeps layers 0, 1, 2, and 18 → 4 layers total
- L7H384 (0_1_2_3_4_5_18): Keeps layers 0–5 and 18 → 7 layers total
- L13H384 (0_1_2_3_4_5_6_7_8_9_10_11_18): Keeps layers 0–11 and 18 → 13 layers total
Why These Configurations?
We chose these "official" configurations based on two criteria:
Simplicity: Consecutive layer indices (0, 1, 2, 3, ...) are easier to understand and reproduce than scattered indices like
0_1_2_3_6_8_18.Competitive performance: While not always the absolute best score, these configurations perform competitively within their layer count category.
For example, L7H384-0_1_2_3_6_8_18 (mean: 0.4722) slightly outperforms our official pick L7H384-0_1_2_3_4_5_18 (mean: 0.4693), but the consecutive layer pattern is more interpretable and the performance difference is marginal.
Why Layer 18?
ModernBERT uses an alternating attention pattern:
- Global attention (g): Full self-attention across all tokens
- Local attention (l): Attention within a sliding window
The pattern follows a g-l-l-g-l-l-... rhythm. In the original 22-layer mmBERT-small, both layer 18 and layer 21 are global attention layers, with layer 21 being the final layer.
However, our experiments showed that ending with layer 18 consistently outperforms ending with layer 21. For example:
L4H384-0_1_2_18(mean: 0.4530) vsL4H384-0_1_2_21(mean: 0.4558)L7H384-0_1_2_3_4_5_18(mean: 0.4693) vsL7H384-0_1_2_3_4_5_21(mean: 0.4629)
This suggests that the representations at layer 18 are more effective for retrieval tasks when combined with early layers, possibly because layer 18 provides a better balance between abstraction and retention of fine-grained information.
Experimental Variations
We explored different pruning strategies by shifting the start positions and coverage:
- Front-heavy (e.g.,
0_1_2_3_4_5_18): Retains early layers, skips middle layers - Back-heavy (e.g.,
0_16_17_18_19_20_21): Retains later layers - Distributed (e.g.,
0_1_2_3_4_5_6_7_8_10_12_15_18): Spreads retained layers across depth
This probes the trade-off between depth (how many layers) and coverage (which parts of the network contribute).
Scores (NDCG@10) — All L4/L7/L13 Runs
| model | mean | NanoArguAna | NanoClimateFEVER | NanoDBPedia | NanoFEVER | NanoFiQA2018 | NanoHotpotQA | NanoMSMARCO | NanoNFCorpus | NanoNQ | NanoQuoraRetrieval | NanoSCIDOCS | NanoSciFact | NanoTouche2020 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mmBERT-small (22 layers) | 0.5151 | 0.4345 | 0.2888 | 0.4548 | 0.7534 | 0.4199 | 0.6629 | 0.5853 | 0.2849 | 0.5634 | 0.9367 | 0.2704 | 0.5042 | 0.5378 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_11_12 | 0.4553 | 0.3908 | 0.2715 | 0.4385 | 0.7290 | 0.3289 | 0.6191 | 0.4702 | 0.2178 | 0.4649 | 0.9198 | 0.2152 | 0.4402 | 0.4129 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_11_15 | 0.4576 | 0.4395 | 0.2457 | 0.4284 | 0.7472 | 0.3237 | 0.5920 | 0.4918 | 0.2199 | 0.4531 | 0.9195 | 0.1852 | 0.4820 | 0.4208 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_11_18 | 0.4964 | 0.4462 | 0.2955 | 0.4907 | 0.7564 | 0.3886 | 0.6469 | 0.5142 | 0.2644 | 0.5268 | 0.9412 | 0.2326 | 0.4840 | 0.4662 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_11_21 | 0.4800 | 0.4162 | 0.2858 | 0.4695 | 0.7197 | 0.3358 | 0.6338 | 0.5512 | 0.2603 | 0.5127 | 0.9305 | 0.2389 | 0.4457 | 0.4393 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_12_18 | 0.4904 | 0.4594 | 0.2619 | 0.4904 | 0.7481 | 0.3832 | 0.6552 | 0.5476 | 0.2540 | 0.5092 | 0.9183 | 0.2411 | 0.4518 | 0.4551 |
| L13H384-0_1_2_3_4_5_6_7_8_9_12_15_18 | 0.4791 | 0.4401 | 0.2754 | 0.4849 | 0.7384 | 0.3201 | 0.6369 | 0.5059 | 0.2478 | 0.5237 | 0.9190 | 0.2602 | 0.4666 | 0.4099 |
| L13H384-0_1_2_3_4_5_6_7_8_10_12_15_18 | 0.4877 | 0.4384 | 0.2749 | 0.4937 | 0.7299 | 0.3366 | 0.6698 | 0.5314 | 0.2588 | 0.5073 | 0.9264 | 0.2430 | 0.4879 | 0.4414 |
| L13H384-0_1_2_3_4_5_6_7_9_12_15_18_21 | 0.4810 | 0.4007 | 0.2739 | 0.4989 | 0.7180 | 0.3403 | 0.6441 | 0.5257 | 0.2541 | 0.5093 | 0.9187 | 0.2413 | 0.4905 | 0.4371 |
| L13H384-0_10_11_12_13_14_15_16_17_18_19_20_21 | 0.4806 | 0.3938 | 0.2855 | 0.4911 | 0.7974 | 0.3504 | 0.6034 | 0.5211 | 0.2361 | 0.4486 | 0.9144 | 0.2257 | 0.4508 | 0.5294 |
| L13H384-9_10_11_12_13_14_15_16_17_18_19_20_21 | 0.4307 | 0.3901 | 0.2621 | 0.4753 | 0.7185 | 0.2927 | 0.5371 | 0.4487 | 0.2361 | 0.3267 | 0.8605 | 0.1513 | 0.3860 | 0.5143 |
| L7H384-0_1_2_3_4_5_6 | 0.4291 | 0.3635 | 0.2839 | 0.4665 | 0.6299 | 0.2958 | 0.5433 | 0.4692 | 0.1841 | 0.4174 | 0.8800 | 0.2217 | 0.4570 | 0.3660 |
| L7H384-0_1_2_3_4_5_9 | 0.4282 | 0.3929 | 0.2719 | 0.4447 | 0.6674 | 0.2890 | 0.5192 | 0.4847 | 0.2226 | 0.3850 | 0.8870 | 0.2074 | 0.4145 | 0.3804 |
| L7H384-0_1_2_3_4_5_12 | 0.4204 | 0.4035 | 0.2501 | 0.4283 | 0.6245 | 0.3044 | 0.5350 | 0.4518 | 0.1900 | 0.3760 | 0.8763 | 0.2073 | 0.4438 | 0.3748 |
| L7H384-0_1_2_3_4_5_18 | 0.4693 | 0.3879 | 0.2782 | 0.5046 | 0.7257 | 0.3631 | 0.6139 | 0.4633 | 0.2353 | 0.4623 | 0.8951 | 0.2310 | 0.5111 | 0.4296 |
| L7H384-0_1_2_3_4_5_21 | 0.4629 | 0.4331 | 0.2731 | 0.4958 | 0.7667 | 0.3368 | 0.5943 | 0.4194 | 0.2666 | 0.4428 | 0.8742 | 0.2542 | 0.4220 | 0.4386 |
| L7H384-0_1_2_3_6_7_8 | 0.4236 | 0.3903 | 0.2590 | 0.4613 | 0.6097 | 0.2692 | 0.5962 | 0.4556 | 0.1790 | 0.3755 | 0.8501 | 0.2157 | 0.4596 | 0.3850 |
| L7H384-0_1_2_3_6_7_12 | 0.4149 | 0.3752 | 0.2369 | 0.4489 | 0.5763 | 0.2798 | 0.5630 | 0.4600 | 0.1955 | 0.3881 | 0.8458 | 0.2303 | 0.4260 | 0.3671 |
| L7H384-0_1_2_3_6_8_12 | 0.4171 | 0.3215 | 0.2305 | 0.4491 | 0.5696 | 0.2803 | 0.5615 | 0.4959 | 0.1897 | 0.3790 | 0.8756 | 0.2313 | 0.4600 | 0.3787 |
| L7H384-0_1_2_3_6_8_18 | 0.4722 | 0.3988 | 0.2619 | 0.5002 | 0.7551 | 0.3186 | 0.6438 | 0.5024 | 0.2429 | 0.4259 | 0.8969 | 0.2162 | 0.5054 | 0.4704 |
| L7H384-0_16_17_18_19_20_21 | 0.4589 | 0.3684 | 0.2711 | 0.4949 | 0.7224 | 0.3087 | 0.5750 | 0.4676 | 0.2317 | 0.4541 | 0.8829 | 0.2050 | 0.4668 | 0.5171 |
| L7H384-15_16_17_18_19_20_21 | 0.4299 | 0.3728 | 0.2747 | 0.4572 | 0.6557 | 0.2529 | 0.5594 | 0.4474 | 0.2197 | 0.3528 | 0.8883 | 0.1887 | 0.4160 | 0.5034 |
| L4H384-0_1_2_3 | 0.3329 | 0.2011 | 0.1529 | 0.4820 | 0.3088 | 0.1937 | 0.4178 | 0.3890 | 0.1897 | 0.3238 | 0.8441 | 0.2045 | 0.2912 | 0.3286 |
| L4H384-0_1_2_18 | 0.4530 | 0.3806 | 0.2544 | 0.4657 | 0.7230 | 0.2793 | 0.5704 | 0.5060 | 0.2270 | 0.4283 | 0.8942 | 0.2246 | 0.4671 | 0.4682 |
| L4H384-0_1_2_21 | 0.4558 | 0.3801 | 0.2553 | 0.4871 | 0.7350 | 0.3097 | 0.5734 | 0.4899 | 0.2510 | 0.4193 | 0.8860 | 0.2249 | 0.4620 | 0.4517 |
| L4H384-0_19_20_21 | 0.4408 | 0.3888 | 0.2651 | 0.4880 | 0.6629 | 0.3018 | 0.6010 | 0.4224 | 0.2342 | 0.4086 | 0.8714 | 0.2027 | 0.4238 | 0.4597 |
| L4H384-18_19_20_21 | 0.4130 | 0.3067 | 0.2546 | 0.4740 | 0.6206 | 0.2363 | 0.5393 | 0.4074 | 0.2233 | 0.2879 | 0.8850 | 0.2015 | 0.4270 | 0.5058 |
Bold rows indicate the official picks for each layer count.
Key Findings
Front-heavy pruning works best: Retaining early layers (0–N) plus a global attention layer consistently outperforms other strategies.
Layer 18 > Layer 21: Ending with layer 18 (global attention) outperforms ending with layer 21 (final global attention layer), suggesting that intermediate global attention layers provide better representations for retrieval when combined with early layers.
Early layers are critical: Models that skip early layers (e.g.,
9_10_11_...or15_16_17_...) show significant performance degradation.Diminishing returns with depth: L13 (0.496) vs L7 (0.469) shows only ~3% improvement for nearly double the layers.
License
MIT
- Downloads last month
- 6
Model tree for hotchpotch/mmBERT-L4H384-pruned
Base model
jhu-clsp/mmBERT-small