W2V-BERT 2.0 ASR Adapters (v25 - ConvAdapter)
This repository contains per-language ConvAdapter modules for automatic speech recognition (ASR) trained on top of facebook/w2v-bert-2.0.
Model Description
- Base Model: facebook/w2v-bert-2.0 (600M parameters, frozen)
- Adapter Architecture: ConvAdapter (Conv1d + depthwise temporal conv, dim=64)
- Decoder: Lightweight transformer decoder (2 layers)
- Training: CTC loss with extended vocabulary for double vowels
Trained Adapters
Summary: 2 adapters trained
- β Good (WER < 30%): 0
- β‘ Medium (30-60%): 0
- β Collapsed (β₯ 90%): 2
| Adapter | Language | WER | Status | Train Samples |
|---|---|---|---|---|
| ach_Latn | Acholi | 99.63% | β Collapsed | 4,825 |
| eng_Latn_salt | English (SALT) | 94.16% | β Collapsed | 4,804 |
Architecture (v25 ConvAdapter)
The model uses:
- Frozen w2v-bert-2.0 encoder - Extracts audio representations
- ConvAdapters - Conv1d + depthwise temporal conv (kernel=5) for local acoustic context
- Lightweight decoder - Transformer decoder blocks (trainable)
- LM head - Per-language vocabulary projection (trainable)
ConvAdapter Details
- Down projection: Conv1d(1024 β 64, k=1)
- Temporal conv: DepthwiseConv1d(64, k=5)
- Up projection: Conv1d(64 β 1024, k=1)
- Activation: SiLU (Swish)
- ~131K params per adapter, ~3.2M total
Usage
Each adapter folder contains:
adapter_weights.pt- ConvAdapter weightsdecoder_weights.pt- Decoder block weightslm_head_weights.pt- Language model head weightsfinal_norm_weights.pt- Final layer norm weightsvocab.json- Language-specific vocabularyadapter_config.json- Adapter configurationmetrics.json- Training metrics
Training Configuration
- Epochs: 10
- Base Learning Rate: 0.0005 (adaptive based on dataset size)
- Batch Size: 48 x 1
- Extended Vocabulary: True
- Adapter Dimension: 64
- Conv Kernel Size: 5
License
Apache 2.0
Model tree for mutisya/w2v-bert-adapters-14lang-e10-28_54-v9
Base model
facebook/w2v-bert-2.0