W2V-BERT 2.0 ASR Adapters (v30 - LightweightConformerAdapter)
This repository contains per-language LightweightConformerAdapter modules for automatic speech recognition (ASR) trained on top of facebook/w2v-bert-2.0.
Model Description
- Base Model: facebook/w2v-bert-2.0 (600M parameters, frozen)
- Adapter Architecture: LightweightConformerAdapter (GLU + depthwise conv + GroupNorm, size=256)
- Decoder: Lightweight transformer decoder (2 layers)
- Training: CTC loss with extended vocabulary for double vowels
Trained Adapters
Summary: 7 adapters trained
- β Good (WER < 30%): 0
- β‘ Medium (30-60%): 0
- β Collapsed (β₯ 90%): 7
| Adapter | Language | WER | Status | Train Samples |
|---|---|---|---|---|
| ach_Latn | Acholi | 95.98% | β Collapsed | 4,825 |
| eng_Latn_salt | English (SALT) | 100.00% | β Collapsed | 4,804 |
| eng_Latn_tts | English (TTS) | 99.87% | β Collapsed | 3,030 |
| ful_Latn | Fulah | 98.36% | β Collapsed | 2,355 |
| kam_Latn | Kamba | 99.33% | β Collapsed | 14,968 |
| kik_Latn | Kikuyu | 99.35% | β Collapsed | 14,966 |
| lug_Latn_salt | Luganda (SALT) | 100.00% | β Collapsed | 5,002 |
Architecture (v30 LightweightConformerAdapter)
The model uses:
- Frozen w2v-bert-2.0 encoder - Extracts audio representations
- LightweightConformerAdapters - GLU gating + depthwise temporal conv (kernel=15) + GroupNorm
- Lightweight decoder - Transformer decoder blocks (trainable)
- LM head - Per-language vocabulary projection (trainable)
Conformer Adapter Details
- Down projection + GLU: Conv1d(1024 β 256*2, k=1) + GLU β 256
- Depthwise conv: DepthwiseConv1d(256, k=15)
- GroupNorm: 32 groups
- Up projection: Conv1d(256 β 1024, k=1)
- Activation: SiLU (Swish)
- ~790K params per adapter, ~19M total
Usage
Each adapter folder contains:
adapter_weights.pt- LightweightConformerAdapter weightsdecoder_weights.pt- Decoder block weightslm_head_weights.pt- Language model head weightsfinal_norm_weights.pt- Final layer norm weightsvocab.json- Language-specific vocabularyadapter_config.json- Adapter configurationmetrics.json- Training metrics
Training Configuration
- Epochs: 10
- Base Learning Rate: 0.0003 (adaptive based on dataset size)
- Batch Size: 48 x 1
- Extended Vocabulary: True
- Adapter Size: 256
- Conv Kernel Size: 15
- GroupNorm Groups: 32
License
Apache 2.0
Model tree for mutisya/w2v-bert-adapters-14lang-e10-28_07-v9
Base model
facebook/w2v-bert-2.0