W2V-BERT 2.0 ASR Adapters (v30 - LightweightConformerAdapter)

This repository contains per-language LightweightConformerAdapter modules for automatic speech recognition (ASR) trained on top of facebook/w2v-bert-2.0.

Model Description

  • Base Model: facebook/w2v-bert-2.0 (600M parameters, frozen)
  • Adapter Architecture: LightweightConformerAdapter (GLU + depthwise conv + GroupNorm, size=256)
  • Decoder: Lightweight transformer decoder (2 layers)
  • Training: CTC loss with extended vocabulary for double vowels

Trained Adapters

Summary: 7 adapters trained

  • βœ… Good (WER < 30%): 0
  • ⚑ Medium (30-60%): 0
  • ❌ Collapsed (β‰₯ 90%): 7
Adapter Language WER Status Train Samples
ach_Latn Acholi 95.98% ❌ Collapsed 4,825
eng_Latn_salt English (SALT) 100.00% ❌ Collapsed 4,804
eng_Latn_tts English (TTS) 99.87% ❌ Collapsed 3,030
ful_Latn Fulah 98.36% ❌ Collapsed 2,355
kam_Latn Kamba 99.33% ❌ Collapsed 14,968
kik_Latn Kikuyu 99.35% ❌ Collapsed 14,966
lug_Latn_salt Luganda (SALT) 100.00% ❌ Collapsed 5,002

Architecture (v30 LightweightConformerAdapter)

The model uses:

  1. Frozen w2v-bert-2.0 encoder - Extracts audio representations
  2. LightweightConformerAdapters - GLU gating + depthwise temporal conv (kernel=15) + GroupNorm
  3. Lightweight decoder - Transformer decoder blocks (trainable)
  4. LM head - Per-language vocabulary projection (trainable)

Conformer Adapter Details

  • Down projection + GLU: Conv1d(1024 β†’ 256*2, k=1) + GLU β†’ 256
  • Depthwise conv: DepthwiseConv1d(256, k=15)
  • GroupNorm: 32 groups
  • Up projection: Conv1d(256 β†’ 1024, k=1)
  • Activation: SiLU (Swish)
  • ~790K params per adapter, ~19M total

Usage

Each adapter folder contains:

  • adapter_weights.pt - LightweightConformerAdapter weights
  • decoder_weights.pt - Decoder block weights
  • lm_head_weights.pt - Language model head weights
  • final_norm_weights.pt - Final layer norm weights
  • vocab.json - Language-specific vocabulary
  • adapter_config.json - Adapter configuration
  • metrics.json - Training metrics

Training Configuration

  • Epochs: 10
  • Base Learning Rate: 0.0003 (adaptive based on dataset size)
  • Batch Size: 48 x 1
  • Extended Vocabulary: True
  • Adapter Size: 256
  • Conv Kernel Size: 15
  • GroupNorm Groups: 32

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mutisya/w2v-bert-adapters-14lang-e10-28_07-v9

Finetuned
(415)
this model