mmBERT-L4H384 / mmBERT-L7H384 / mmBERT-L13H384

Pruned variants of mmBERT-small.

Models

⚠️ Note: Pruning-Only (Not Distilled)

These are pruning-only variants—we simply remove layers without any knowledge distillation or fine-tuning. Fully trained or distilled models with the same architecture may outperform these pruned versions.

Overview

These models are created by layer pruning from mmBERT-small (22 layers, 384 hidden dimensions). We select specific layers to retain while preserving the ModernBERT global/local attention cadence.

Layer Selection and Evaluation

We fine-tuned the pruned models for information retrieval on the MS MARCO dataset and evaluated them on nanoBEIR (NDCG@10).

The numbers in model names (e.g., 0_1_2_18) indicate which layers are retained from the original 22-layer model:

  • L4H384 (0_1_2_18): Keeps layers 0, 1, 2, and 18 → 4 layers total
  • L7H384 (0_1_2_3_4_5_18): Keeps layers 0–5 and 18 → 7 layers total
  • L13H384 (0_1_2_3_4_5_6_7_8_9_10_11_18): Keeps layers 0–11 and 18 → 13 layers total

Why These Configurations?

We chose these "official" configurations based on two criteria:

  1. Simplicity: Consecutive layer indices (0, 1, 2, 3, ...) are easier to understand and reproduce than scattered indices like 0_1_2_3_6_8_18.

  2. Competitive performance: While not always the absolute best score, these configurations perform competitively within their layer count category.

For example, L7H384-0_1_2_3_6_8_18 (mean: 0.4722) slightly outperforms our official pick L7H384-0_1_2_3_4_5_18 (mean: 0.4693), but the consecutive layer pattern is more interpretable and the performance difference is marginal.

Why Layer 18?

ModernBERT uses an alternating attention pattern:

  • Global attention (g): Full self-attention across all tokens
  • Local attention (l): Attention within a sliding window

The pattern follows a g-l-l-g-l-l-... rhythm. In the original 22-layer mmBERT-small, both layer 18 and layer 21 are global attention layers, with layer 21 being the final layer.

However, our experiments showed that ending with layer 18 consistently outperforms ending with layer 21. For example:

  • L4H384-0_1_2_18 (mean: 0.4530) vs L4H384-0_1_2_21 (mean: 0.4558)
  • L7H384-0_1_2_3_4_5_18 (mean: 0.4693) vs L7H384-0_1_2_3_4_5_21 (mean: 0.4629)

This suggests that the representations at layer 18 are more effective for retrieval tasks when combined with early layers, possibly because layer 18 provides a better balance between abstraction and retention of fine-grained information.

Experimental Variations

We explored different pruning strategies by shifting the start positions and coverage:

  • Front-heavy (e.g., 0_1_2_3_4_5_18): Retains early layers, skips middle layers
  • Back-heavy (e.g., 0_16_17_18_19_20_21): Retains later layers
  • Distributed (e.g., 0_1_2_3_4_5_6_7_8_10_12_15_18): Spreads retained layers across depth

This probes the trade-off between depth (how many layers) and coverage (which parts of the network contribute).

Scores (NDCG@10) — All L4/L7/L13 Runs

model mean NanoArguAna NanoClimateFEVER NanoDBPedia NanoFEVER NanoFiQA2018 NanoHotpotQA NanoMSMARCO NanoNFCorpus NanoNQ NanoQuoraRetrieval NanoSCIDOCS NanoSciFact NanoTouche2020
mmBERT-small-L13H384-0_1_2_3_4_5_6_7_8_9_10_11_12 0.4553 0.3908 0.2715 0.4385 0.7290 0.3289 0.6191 0.4702 0.2178 0.4649 0.9198 0.2152 0.4402 0.4129
mmBERT-small-L13H384-0_1_2_3_4_5_6_7_8_9_10_11_15 0.4576 0.4395 0.2457 0.4284 0.7472 0.3237 0.5920 0.4918 0.2199 0.4531 0.9195 0.1852 0.4820 0.4208
mmBERT-small-L13H384-0_1_2_3_4_5_6_7_8_9_10_11_18 0.4964 0.4462 0.2955 0.4907 0.7564 0.3886 0.6469 0.5142 0.2644 0.5268 0.9412 0.2326 0.4840 0.4662
mmBERT-small-L13H384-0_1_2_3_4_5_6_7_8_9_10_11_21 0.4800 0.4162 0.2858 0.4695 0.7197 0.3358 0.6338 0.5512 0.2603 0.5127 0.9305 0.2389 0.4457 0.4393
mmBERT-small-L13H384-0_1_2_3_4_5_6_7_8_9_10_12_18 0.4904 0.4594 0.2619 0.4904 0.7481 0.3832 0.6552 0.5476 0.2540 0.5092 0.9183 0.2411 0.4518 0.4551
mmBERT-small-L13H384-0_1_2_3_4_5_6_7_8_9_12_15_18 0.4791 0.4401 0.2754 0.4849 0.7384 0.3201 0.6369 0.5059 0.2478 0.5237 0.9190 0.2602 0.4666 0.4099
mmBERT-small-L13H384-0_1_2_3_4_5_6_7_8_10_12_15_18 0.4877 0.4384 0.2749 0.4937 0.7299 0.3366 0.6698 0.5314 0.2588 0.5073 0.9264 0.2430 0.4879 0.4414
mmBERT-small-L13H384-0_1_2_3_4_5_6_7_9_12_15_18_21 0.4810 0.4007 0.2739 0.4989 0.7180 0.3403 0.6441 0.5257 0.2541 0.5093 0.9187 0.2413 0.4905 0.4371
mmBERT-small-L13H384-0_10_11_12_13_14_15_16_17_18_19_20_21 0.4806 0.3938 0.2855 0.4911 0.7974 0.3504 0.6034 0.5211 0.2361 0.4486 0.9144 0.2257 0.4508 0.5294
mmBERT-small-L13H384-9_10_11_12_13_14_15_16_17_18_19_20_21 0.4307 0.3901 0.2621 0.4753 0.7185 0.2927 0.5371 0.4487 0.2361 0.3267 0.8605 0.1513 0.3860 0.5143
mmBERT-small-L7H384-0_1_2_3_4_5_6 0.4291 0.3635 0.2839 0.4665 0.6299 0.2958 0.5433 0.4692 0.1841 0.4174 0.8800 0.2217 0.4570 0.3660
mmBERT-small-L7H384-0_1_2_3_4_5_9 0.4282 0.3929 0.2719 0.4447 0.6674 0.2890 0.5192 0.4847 0.2226 0.3850 0.8870 0.2074 0.4145 0.3804
mmBERT-small-L7H384-0_1_2_3_4_5_12 0.4204 0.4035 0.2501 0.4283 0.6245 0.3044 0.5350 0.4518 0.1900 0.3760 0.8763 0.2073 0.4438 0.3748
mmBERT-small-L7H384-0_1_2_3_4_5_18 0.4693 0.3879 0.2782 0.5046 0.7257 0.3631 0.6139 0.4633 0.2353 0.4623 0.8951 0.2310 0.5111 0.4296
mmBERT-small-L7H384-0_1_2_3_4_5_21 0.4629 0.4331 0.2731 0.4958 0.7667 0.3368 0.5943 0.4194 0.2666 0.4428 0.8742 0.2542 0.4220 0.4386
mmBERT-small-L7H384-0_1_2_3_6_7_8 0.4236 0.3903 0.2590 0.4613 0.6097 0.2692 0.5962 0.4556 0.1790 0.3755 0.8501 0.2157 0.4596 0.3850
mmBERT-small-L7H384-0_1_2_3_6_7_12 0.4149 0.3752 0.2369 0.4489 0.5763 0.2798 0.5630 0.4600 0.1955 0.3881 0.8458 0.2303 0.4260 0.3671
mmBERT-small-L7H384-0_1_2_3_6_8_12 0.4171 0.3215 0.2305 0.4491 0.5696 0.2803 0.5615 0.4959 0.1897 0.3790 0.8756 0.2313 0.4600 0.3787
mmBERT-small-L7H384-0_1_2_3_6_8_18 0.4722 0.3988 0.2619 0.5002 0.7551 0.3186 0.6438 0.5024 0.2429 0.4259 0.8969 0.2162 0.5054 0.4704
mmBERT-small-L7H384-0_16_17_18_19_20_21 0.4589 0.3684 0.2711 0.4949 0.7224 0.3087 0.5750 0.4676 0.2317 0.4541 0.8829 0.2050 0.4668 0.5171
mmBERT-small-L7H384-15_16_17_18_19_20_21 0.4299 0.3728 0.2747 0.4572 0.6557 0.2529 0.5594 0.4474 0.2197 0.3528 0.8883 0.1887 0.4160 0.5034
mmBERT-small-L4H384-0_1_2_3 0.3329 0.2011 0.1529 0.4820 0.3088 0.1937 0.4178 0.3890 0.1897 0.3238 0.8441 0.2045 0.2912 0.3286
mmBERT-small-L4H384-0_1_2_18 0.4530 0.3806 0.2544 0.4657 0.7230 0.2793 0.5704 0.5060 0.2270 0.4283 0.8942 0.2246 0.4671 0.4682
mmBERT-small-L4H384-0_1_2_21 0.4558 0.3801 0.2553 0.4871 0.7350 0.3097 0.5734 0.4899 0.2510 0.4193 0.8860 0.2249 0.4620 0.4517
mmBERT-small-L4H384-0_19_20_21 0.4408 0.3888 0.2651 0.4880 0.6629 0.3018 0.6010 0.4224 0.2342 0.4086 0.8714 0.2027 0.4238 0.4597
mmBERT-small-L4H384-18_19_20_21 0.4130 0.3067 0.2546 0.4740 0.6206 0.2363 0.5393 0.4074 0.2233 0.2879 0.8850 0.2015 0.4270 0.5058

Bold rows indicate the official picks for each layer count.

Key Findings

  1. Front-heavy pruning works best: Retaining early layers (0–N) plus a global attention layer consistently outperforms other strategies.

  2. Layer 18 > Layer 21: Ending with layer 18 (global attention) outperforms ending with layer 21 (final global attention layer), suggesting that intermediate global attention layers provide better representations for retrieval when combined with early layers.

  3. Early layers are critical: Models that skip early layers (e.g., 9_10_11_... or 15_16_17_...) show significant performance degradation.

  4. Diminishing returns with depth: L13 (0.496) vs L7 (0.469) shows only ~3% improvement for nearly double the layers.

License

MIT

Downloads last month
44
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hotchpotch/mmBERT-L7H384-pruned

Finetuned
(25)
this model