Yo-ByT5
This model is a fine-tuned version of google/byt5-small on a Yoruba dataset. It is designed to automatically restore diacritics (tone marks and underdots) to Yoruba text, which is crucial for lexical disambiguation and proper pronunciation in downstream tasks.
Model Description
- Model Type: Byte-level T5 (ByT5) for Sequence-to-Sequence operations.
- Language(s): Yoruba (yo)
- Task: Diacritic Restoration (Automatic Diacritization)
- Developed by: Gali Ahmad Samuel (lazymonster)
- Shared by: Gali Ahmad Samuel (lazymonster)
Yoruba is a tonal language where the meaning of a word relies heavily on tone marks (acute and grave accents) and underdots. This model takes non-diacritized (or partially diacritized) text as input and outputs the fully diacritized text.
Intended Uses & Limitations
Intended Uses
- Preprocessing: Cleaning text for Text-to-Speech (TTS) or Machine Translation (MT) systems where accurate diacritics are mandatory.
- Search Engines: Normalizing user queries in Yoruba.
- Linguistic Analysis: Assisting in the annotation of low-resource language datasets.
Limitations
- The model may struggle with proper nouns or ambiguous context where multiple valid diacritization patterns exist for the same character sequence (e.g., owo could be owó [money], ọwọ́ [hand], or ọ̀wọ̀ [honor]).
- Inference speed is slower than word-level models due to the byte-level tokenization of ByT5.
Training and Evaluation Data
More information needed
Training Procedure
The model was trained using the Hugging Face Seq2SeqTrainer on Google Cloud TPUs.
Training Hyperparameters
The following hyperparameters were used during training:
- Learning Rate: 2e-4
- Effective Train Batch Size: 32
- Eval Batch Size: 16
- Seed: 42
- Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
- LR Scheduler: Linear
- Num Epochs: 20
- Hardware: Google Cloud TPU v6e-8
Framework Versions
- Transformers 4.53.3
- Pytorch 2.6.0+cu124
- Datasets 4.4.1
- Tokenizers 0.21.2
- Torch_xla (TPU Support)
Evaluation Results
The model was evaluated on a held-out test set using beam search (num_beams=5).
| Metric | Value | Description |
|---|---|---|
| Word Accuracy | 83.79% | Percentage of words perfectly reconstructed. |
| Underdot Accuracy | 92.35% | Accuracy of restoring sub-character underdots. |
| WER | 0.1628 | Word Error Rate (lower is better). |
| CER | 0.0558 | Character Error Rate (lower is better). |
| Yoruba DER | 0.0397 | Diacritic Error Rate specific to Yoruba markers. |
| BLEU | 0.6875 | Bilingual Evaluation Understudy score. |
| ChrF | 83.91 | Character n-gram F-score. |
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("lazymonster/yobyt5-restoration")
model = AutoModelForSeq2SeqLM.from_pretrained("lazymonster/yobyt5-restoration")
text = "Mo n lo si ile iwe" # "I am going to school" without diacritics
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(
inputs["input_ids"],
max_length=1024,
num_beams=5
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Expected Output: "Mo ń lọ sí ilé ìwé"
Acknowledgments
This model was trained using compute resources generously provided by the Google TPU Research Cloud (TRC).
- Downloads last month
- 8
Evaluation results
- Word Accuracyself-reported0.838
- Underdot Accuracyself-reported0.923
- WERself-reported0.163
- CERself-reported0.056
- BLEUself-reported0.688
- ChrFself-reported83.912