YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Igbo Tone & Diacritic Restoration (ByT5-small)
Automatically restores Igbo diacritics β tone marks (Γ Γ‘Δ) and subdot vowels (α»α»α»₯) β from plain text. Built on ByT5-small (byte-level seq2seq), fine-tuned on a rebalanced dataset of 46,057 Igbo sentences.
Part of the Igbo Speech Project β this model serves as a preprocessor for TTS and a post-processor for ASR.
Key Results
| Metric | Accuracy |
|---|---|
| Tone mark accuracy | 61.6% |
| Subdot accuracy (α», α», α»₯) | 88.2% |
| Overall diacritic accuracy | 78.7% |
| Word exact match | 34.3% |
Why This Matters
Most Igbo text online lacks diacritics. We measured the diacritic gap across three sources:
| Source | Tone marking rate |
|---|---|
| Well-toned corpus | 96% of vowels |
| IgboAPI dictionary | 46% of vowels |
| African Voices (crowd-sourced) | 14% of vowels |
78% of African Voices transcripts have zero tone marks. Without automatic restoration, TTS systems receive ambiguous input and ASR output lacks proper orthography.
Model Details
| Property | Value |
|---|---|
| Base model | google/byt5-small |
| Architecture | ByT5 (byte-level T5, encoder-decoder) |
| Parameters | 300M |
| Task | Seq2seq: plain text β fully diacriticized text |
| Training data | 46,057 sentences (rebalanced: 56% toned corpus, 28% IgboAPI, 16% Bible) |
| Training time | ~32 hours on Apple M4 MPS (est. ~48 min on H100) |
| Inference | num_beams=4, max_length=512 |
Training Data Composition (v2 β rebalanced)
| Source | Sentences | Tone density | Weight |
|---|---|---|---|
| Well-toned corpus (4Γ oversample) | 26,800 | 96% | 56% |
| IgboAPI dictionary (3Γ oversample, normalized) | 12,900 | 46% | 28% |
| Igbo Bible (capped at 8K) | 6,400 | ~0% (subdots only) | 16% |
Key insight: v1 used 76% Bible data (no tones) and achieved only 48.4% tone accuracy. Rebalancing to 84% toned data in v2 improved tone accuracy to 61.6% (+13 pp). Data composition > model size.
Usage
from transformers import AutoTokenizer, T5ForConditionalGeneration
import torch
model_dir = "path/to/tone_model/best"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = T5ForConditionalGeneration.from_pretrained(model_dir)
model.eval()
text = "Kedu ka i mere"
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
restored = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(restored) # "Kèdù kà à mèrè"
Long Text
from igbo_tts.tone_model.predict import ToneRestorer
restorer = ToneRestorer(model_dir="path/to/tone_model/best")
text = restorer.restore_long("Igbo bu asusu ndi Igbo. Anyi na-asu ya kwa ubochi.")
Pipeline Role
βββββββββββββββββββββββ
User text (untoned) β Tone Model β Toned text
"Kedu ka i mere" βββΊ (this model) βββΊ "KΓ¨dΓΉ kΓ Γ mΓ¨rΓ¨"
βββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
TTS input ASR output Keyboard
(strip tones (add tones autocorrect
keep subdots) to plain)
For TTS: Restores subdots (α»₯, α») which are essential for pronunciation. Tone marks are stripped before synthesis (F5-TTS trained on untoned text).
For ASR: Post-processes untoned ASR output into proper Igbo orthography.
For keyboards: Real-time diacritization as users type plain Igbo text.
API
Available via api/server.py:
# Single text
curl -X POST http://localhost:8000/diacriticize \
-H "Content-Type: application/json" \
-d '{"text": "Kedu ka i mere"}'
# Batch mode
curl -X POST http://localhost:8000/diacriticize \
-H "Content-Type: application/json" \
-d '{"texts": ["Kedu ka i mere", "Igbo bu asusu anyi"]}'
Response: {input, output, tone_ratio_before, tone_ratio_after}
Weights
Model weights are not hosted on this repository. See the GitHub repo for access instructions.
Checkpoint files:
model.safetensors(1.1 GB) β ByT5-small fine-tuned weightsconfig.json,tokenizer_config.json,special_tokens_map.json,added_tokens.jsonβ model configsgeneration_config.jsonβ beam search settings
Known Limitations
- Tone accuracy plateaued at 61.6% β limited by training data quality and convention conflicts
- Single-word ambiguity: words like akwa (cry/cloth/egg/bed) require sentence context for correct tone
- αΉ (dot-above) accuracy ~0%: too rare in training data (6 test instances)
- Convention conflict: training mixes full marking (corpus) and contrastive marking (IgboAPI)
License
This model is released under CC-BY-NC-SA 4.0.
- BY: You must give appropriate credit
- NC: Non-commercial use only
- SA: Derivatives must use the same license
Note: The base model (ByT5-small) is Apache 2.0 and the training data (African Voices) is CC-BY-4.0, so this model could use a more permissive license. We use CC-BY-NC-SA for consistency across the Igbo Speech Project models.
Citation
@misc{chimezie2026igbotone,
title={Igbo Tone and Diacritic Restoration with ByT5},
author={Chimezie, Emmanuel},
year={2026},
url={https://github.com/chimezie90/igbotts}
}
Author
Emmanuel Chimezie β Mexkoy Labs