yobyt5-restoration / README.md
lazymonster's picture
Update README.md
8c36948 verified
metadata
library_name: transformers
tags:
  - generated_from_trainer
  - nlp
  - diacritic-restoration
  - yoruba
  - byt5
  - seq2seq
language:
  - yo
model-index:
  - name: byt5-yoruba-restoration-v5
    results:
      - task:
          name: Diacritic Restoration
          type: text2text-generation
        metrics:
          - name: Word Accuracy
            type: accuracy
            value: 0.8379
          - name: Underdot Accuracy
            type: accuracy
            value: 0.9235
          - name: WER
            type: wer
            value: 0.1628
          - name: CER
            type: cer
            value: 0.0558
          - name: BLEU
            type: bleu
            value: 0.6875
          - name: ChrF
            type: chrf
            value: 83.9125

Yo-ByT5

This model is a fine-tuned version of google/byt5-small on a Yoruba dataset. It is designed to automatically restore diacritics (tone marks and underdots) to Yoruba text, which is crucial for lexical disambiguation and proper pronunciation in downstream tasks.

Model Description

  • Model Type: Byte-level T5 (ByT5) for Sequence-to-Sequence operations.
  • Language(s): Yoruba (yo)
  • Task: Diacritic Restoration (Automatic Diacritization)
  • Developed by: Gali Ahmad Samuel (lazymonster)
  • Shared by: Gali Ahmad Samuel (lazymonster)

Yoruba is a tonal language where the meaning of a word relies heavily on tone marks (acute and grave accents) and underdots. This model takes non-diacritized (or partially diacritized) text as input and outputs the fully diacritized text.

Intended Uses & Limitations

Intended Uses

  • Preprocessing: Cleaning text for Text-to-Speech (TTS) or Machine Translation (MT) systems where accurate diacritics are mandatory.
  • Search Engines: Normalizing user queries in Yoruba.
  • Linguistic Analysis: Assisting in the annotation of low-resource language datasets.

Limitations

  • The model may struggle with proper nouns or ambiguous context where multiple valid diacritization patterns exist for the same character sequence (e.g., owo could be owó [money], ọwọ́ [hand], or ọ̀wọ̀ [honor]).
  • Inference speed is slower than word-level models due to the byte-level tokenization of ByT5.

Training and Evaluation Data

More information needed

Training Procedure

The model was trained using the Hugging Face Seq2SeqTrainer on Google Cloud TPUs.

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning Rate: 2e-4
  • Effective Train Batch Size: 32
  • Eval Batch Size: 16
  • Seed: 42
  • Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
  • LR Scheduler: Linear
  • Num Epochs: 20
  • Hardware: Google Cloud TPU v6e-8

Framework Versions

  • Transformers 4.53.3
  • Pytorch 2.6.0+cu124
  • Datasets 4.4.1
  • Tokenizers 0.21.2
  • Torch_xla (TPU Support)

Evaluation Results

The model was evaluated on a held-out test set using beam search (num_beams=5).

Metric Value Description
Word Accuracy 83.79% Percentage of words perfectly reconstructed.
Underdot Accuracy 92.35% Accuracy of restoring sub-character underdots.
WER 0.1628 Word Error Rate (lower is better).
CER 0.0558 Character Error Rate (lower is better).
Yoruba DER 0.0397 Diacritic Error Rate specific to Yoruba markers.
BLEU 0.6875 Bilingual Evaluation Understudy score.
ChrF 83.91 Character n-gram F-score.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("lazymonster/yobyt5-restoration")
model = AutoModelForSeq2SeqLM.from_pretrained("lazymonster/yobyt5-restoration")

text = "Mo n lo si ile iwe" # "I am going to school" without diacritics
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)

outputs = model.generate(
    inputs["input_ids"],
    max_length=1024,
    num_beams=5
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Expected Output: "Mo ń lọ sí ilé ìwé"

Acknowledgments

This model was trained using compute resources generously provided by the Google TPU Research Cloud (TRC).