YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 Learnings from Hausa Wav2Vec2 Speech-to-Text Fine-Tuning

1. Overview

This training run fine-tuned a Wav2Vec2-based Automatic Speech Recognition (ASR) model for Hausa language transcription. The model was trained on labeled Hausa speech data, with evaluation metrics including Word Error Rate (WER), training/evaluation loss, and system metrics such as gradient norm, samples per second, and runtime.

The experiment successfully converged and produced intelligible transcriptions that align closely with the ground truth, showing notable phonetic and orthographic awareness despite minor deviations in word boundaries or inflectional endings.


Screenshot 2025-11-02 at 1.51.25 pm

2. Training Dynamics

Loss Trend

  • Training Loss decreased sharply from ~25 to below 2 within the first 400–600 steps, showing strong early convergence.
  • After ~1000 steps, the loss plateaued smoothly around 1.0–1.2, indicating that the model had stabilized and continued to refine minor acoustic-textual mismatches.
  • The Eval Loss followed a similar trajectory, confirming good generalization and minimal overfitting.

This pattern indicates that the learning rate scheduler and optimizer were well-tuned, and that gradient stability was achieved early.

Learning Rate Schedule

  • The learning rate followed a linear warm-up to approximately 5e-5 before decaying.
  • This warm-up phase corresponds exactly to the period of steep loss reduction, confirming proper scheduler configuration.
  • The smooth decline after the peak demonstrates effective control of gradient magnitudes and training momentum.

Gradient Norm

  • Gradient norm started high (~60) during the first few hundred steps and quickly stabilized around 10–15.
  • This stabilization indicates that gradients became well-scaled and the model entered a stable optimization regime.
  • Only a few minor spikes were observed, possibly due to occasional harder batches or noisy audio samples.

3. Evaluation Behavior

Word Error Rate (WER)

  • Initially, WER remained around 1.0–1.2 (100%+ error) until roughly step 880, after which it steadily dropped to around 0.6.
  • The delayed improvement is consistent with Wav2Vec2’s pretraining adaptation phase — the model first aligns latent speech representations before learning linguistic decoding patterns.
  • After 880 steps, the model began mapping phonemes to words more accurately, likely due to better alignment between the feature extractor and the CTC head.
  • The final WER (≈0.6) represents substantial improvement, though further training or decoding optimization (e.g., beam search with a language model) could lower it further.

Eval Loss and Runtime

  • Eval loss mirrored training loss, indicating consistent generalization.
  • Eval runtime and samples per second remained stable, suggesting efficient GPU utilization and no data bottlenecks.

4. Qualitative Analysis

Screenshot 2025-11-02 at 1.50.33 pm

Example Inferences:

  • Predictions showed phonetic consistency with ground truth, even when orthographic variations occurred:

    • e.g., “karamin kauye” → “kawoy kiloyta” — model preserved phonetic essence but lacked proper token segmentation.
    • “jin tsoron wani abu” → “jn sorin wania” — correct phoneme-to-grapheme structure but missing spacing and vowel normalization.
  • This implies the model has learned the acoustic structure of Hausa speech effectively but still needs more data or a language model for text normalization and linguistic fluency.

Notable Strengths:

  • Excellent handling of short and clear sentences (3–7s).
  • Good representation of Hausa-specific phonemes like /ƙ/, /ts/, /sh/.
  • Low overfitting: training and eval curves remain close.

Remaining Weaknesses:

  • Token collapse errors (omitting short connecting words).
  • Phoneme–grapheme ambiguity where accent or dialectal variation exists.
  • Absence of post-processing or LM decoding to enforce word validity.

5. Interpretation of Training Behavior

The sharp drop in loss and eventual WER reduction confirm that:

  • The model successfully adapted to Hausa phonetics from the pretrained multilingual base.
  • The initial WER plateau reflects the phase where speech representation learning dominates before linguistic fine-tuning catches up.
  • The performance trend matches expectations for low-resource ASR fine-tuning — strong early convergence, followed by gradual phonetic-linguistic alignment.

7. Summary

Metric Observation
Train Loss ↓ from 25 → ~1.0
Eval Loss ↓ from 25 → ~1.5
WER ↓ from 1.2 → 0.6 (after 880th step)
Gradient Norm Stabilized around 10–15
Learning Rate Peak ~5e-5 at ~1000 steps
Inference Quality Good phonetic accuracy, partial orthographic alignment

8. Conclusion

This Wav2Vec2 Hausa ASR fine-tuning run demonstrates strong convergence, good generalization, and a clear turning point around 880 steps where the model transitions from feature alignment to linguistic decoding. The trajectory and inference samples indicate a robust foundation for further improvement through decoding optimization and data expansion.

Downloads last month
4
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support