🧠 Learnings from Hausa Wav2Vec2 Speech-to-Text Fine-Tuning
1. Overview
This training run fine-tuned a Wav2Vec2-based Automatic Speech Recognition (ASR) model for Hausa language transcription. The model was trained on labeled Hausa speech data, with evaluation metrics including Word Error Rate (WER), training/evaluation loss, and system metrics such as gradient norm, samples per second, and runtime.
The experiment successfully converged and produced intelligible transcriptions that align closely with the ground truth, showing notable phonetic and orthographic awareness despite minor deviations in word boundaries or inflectional endings.
2. Training Dynamics
Loss Trend
- Training Loss decreased sharply from ~25 to below 2 within the first 400–600 steps, showing strong early convergence.
- After ~1000 steps, the loss plateaued smoothly around 1.0–1.2, indicating that the model had stabilized and continued to refine minor acoustic-textual mismatches.
- The Eval Loss followed a similar trajectory, confirming good generalization and minimal overfitting.
This pattern indicates that the learning rate scheduler and optimizer were well-tuned, and that gradient stability was achieved early.
Learning Rate Schedule
- The learning rate followed a linear warm-up to approximately 5e-5 before decaying.
- This warm-up phase corresponds exactly to the period of steep loss reduction, confirming proper scheduler configuration.
- The smooth decline after the peak demonstrates effective control of gradient magnitudes and training momentum.
Gradient Norm
- Gradient norm started high (~60) during the first few hundred steps and quickly stabilized around 10–15.
- This stabilization indicates that gradients became well-scaled and the model entered a stable optimization regime.
- Only a few minor spikes were observed, possibly due to occasional harder batches or noisy audio samples.
3. Evaluation Behavior
Word Error Rate (WER)
- Initially, WER remained around 1.0–1.2 (100%+ error) until roughly step 880, after which it steadily dropped to around 0.6.
- The delayed improvement is consistent with Wav2Vec2’s pretraining adaptation phase — the model first aligns latent speech representations before learning linguistic decoding patterns.
- After 880 steps, the model began mapping phonemes to words more accurately, likely due to better alignment between the feature extractor and the CTC head.
- The final WER (≈0.6) represents substantial improvement, though further training or decoding optimization (e.g., beam search with a language model) could lower it further.
Eval Loss and Runtime
- Eval loss mirrored training loss, indicating consistent generalization.
- Eval runtime and samples per second remained stable, suggesting efficient GPU utilization and no data bottlenecks.
4. Qualitative Analysis
Example Inferences:
Predictions showed phonetic consistency with ground truth, even when orthographic variations occurred:
- e.g., “karamin kauye” → “kawoy kiloyta” — model preserved phonetic essence but lacked proper token segmentation.
- “jin tsoron wani abu” → “jn sorin wania” — correct phoneme-to-grapheme structure but missing spacing and vowel normalization.
This implies the model has learned the acoustic structure of Hausa speech effectively but still needs more data or a language model for text normalization and linguistic fluency.
Notable Strengths:
- Excellent handling of short and clear sentences (3–7s).
- Good representation of Hausa-specific phonemes like /ƙ/, /ts/, /sh/.
- Low overfitting: training and eval curves remain close.
Remaining Weaknesses:
- Token collapse errors (omitting short connecting words).
- Phoneme–grapheme ambiguity where accent or dialectal variation exists.
- Absence of post-processing or LM decoding to enforce word validity.
5. Interpretation of Training Behavior
The sharp drop in loss and eventual WER reduction confirm that:
- The model successfully adapted to Hausa phonetics from the pretrained multilingual base.
- The initial WER plateau reflects the phase where speech representation learning dominates before linguistic fine-tuning catches up.
- The performance trend matches expectations for low-resource ASR fine-tuning — strong early convergence, followed by gradual phonetic-linguistic alignment.
7. Summary
| Metric | Observation |
|---|---|
| Train Loss | ↓ from 25 → ~1.0 |
| Eval Loss | ↓ from 25 → ~1.5 |
| WER | ↓ from 1.2 → 0.6 (after 880th step) |
| Gradient Norm | Stabilized around 10–15 |
| Learning Rate Peak | ~5e-5 at ~1000 steps |
| Inference Quality | Good phonetic accuracy, partial orthographic alignment |
8. Conclusion
This Wav2Vec2 Hausa ASR fine-tuning run demonstrates strong convergence, good generalization, and a clear turning point around 880 steps where the model transitions from feature alignment to linguistic decoding. The trajectory and inference samples indicate a robust foundation for further improvement through decoding optimization and data expansion.
- Downloads last month
- 4

