Yo-ByT5

This model is a fine-tuned version of google/byt5-small on a Yoruba dataset. It is designed to automatically restore diacritics (tone marks and underdots) to Yoruba text, which is crucial for lexical disambiguation and proper pronunciation in downstream tasks.

Model Description

  • Model Type: Byte-level T5 (ByT5) for Sequence-to-Sequence operations.
  • Language(s): Yoruba (yo)
  • Task: Diacritic Restoration (Automatic Diacritization)
  • Developed by: Gali Ahmad Samuel (lazymonster)
  • Shared by: Gali Ahmad Samuel (lazymonster)

Yoruba is a tonal language where the meaning of a word relies heavily on tone marks (acute and grave accents) and underdots. This model takes non-diacritized (or partially diacritized) text as input and outputs the fully diacritized text.

Intended Uses & Limitations

Intended Uses

  • Preprocessing: Cleaning text for Text-to-Speech (TTS) or Machine Translation (MT) systems where accurate diacritics are mandatory.
  • Search Engines: Normalizing user queries in Yoruba.
  • Linguistic Analysis: Assisting in the annotation of low-resource language datasets.

Limitations

  • The model may struggle with proper nouns or ambiguous context where multiple valid diacritization patterns exist for the same character sequence (e.g., owo could be owó [money], ọwọ́ [hand], or ọ̀wọ̀ [honor]).
  • Inference speed is slower than word-level models due to the byte-level tokenization of ByT5.

Training and Evaluation Data

More information needed

Training Procedure

The model was trained using the Hugging Face Seq2SeqTrainer on Google Cloud TPUs.

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning Rate: 2e-4
  • Effective Train Batch Size: 32
  • Eval Batch Size: 16
  • Seed: 42
  • Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
  • LR Scheduler: Linear
  • Num Epochs: 20
  • Hardware: Google Cloud TPU v6e-8

Framework Versions

  • Transformers 4.53.3
  • Pytorch 2.6.0+cu124
  • Datasets 4.4.1
  • Tokenizers 0.21.2
  • Torch_xla (TPU Support)

Evaluation Results

The model was evaluated on a held-out test set using beam search (num_beams=5).

Metric Value Description
Word Accuracy 83.79% Percentage of words perfectly reconstructed.
Underdot Accuracy 92.35% Accuracy of restoring sub-character underdots.
WER 0.1628 Word Error Rate (lower is better).
CER 0.0558 Character Error Rate (lower is better).
Yoruba DER 0.0397 Diacritic Error Rate specific to Yoruba markers.
BLEU 0.6875 Bilingual Evaluation Understudy score.
ChrF 83.91 Character n-gram F-score.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("lazymonster/yobyt5-restoration")
model = AutoModelForSeq2SeqLM.from_pretrained("lazymonster/yobyt5-restoration")

text = "Mo n lo si ile iwe" # "I am going to school" without diacritics
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)

outputs = model.generate(
    inputs["input_ids"],
    max_length=1024,
    num_beams=5
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Expected Output: "Mo ń lọ sí ilé ìwé"

Acknowledgments

This model was trained using compute resources generously provided by the Google TPU Research Cloud (TRC).

Downloads last month
8
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results