Automatic Speech Recognition
Transformers
PyTorch
Safetensors
wav2vec2
mms
audio
voice
speech
forced-alignment
Instructions to use MahmoudAshraf/mms-300m-1130-forced-aligner with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MahmoudAshraf/mms-300m-1130-forced-aligner with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="MahmoudAshraf/mms-300m-1130-forced-aligner")# Load model directly from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("MahmoudAshraf/mms-300m-1130-forced-aligner") model = AutoModelForCTC.from_pretrained("MahmoudAshraf/mms-300m-1130-forced-aligner") - Notebooks
- Google Colab
- Kaggle
Forced Alignment with Hugging Face CTC Models
This Python package provides an efficient way to perform forced alignment between text and audio using Hugging Face's pretrained models. it also features an improved implementation to use much less memory than TorchAudio forced alignment API.
The model checkpoint uploaded here is a conversion from torchaudio to HF Transformers for the MMS-300M checkpoint trained on forced alignment dataset
Installation
pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
Usage
import torch
from ctc_forced_aligner import (
load_audio,
load_alignment_model,
generate_emissions,
preprocess_text,
get_alignments,
get_spans,
postprocess_results,
)
audio_path = "your/audio/path"
text_path = "your/text/path"
language = "iso" # ISO-639-3 Language code
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16
alignment_model, alignment_tokenizer = load_alignment_model(
device,
dtype=torch.float16 if device == "cuda" else torch.float32,
)
audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)
with open(text_path, "r") as f:
lines = f.readlines()
text = "".join(line for line in lines).replace("\n", " ").strip()
emissions, stride = generate_emissions(
alignment_model, audio_waveform, batch_size=batch_size
)
tokens_starred, text_starred = preprocess_text(
text,
romanize=True,
language=language,
)
segments, scores, blank_token = get_alignments(
emissions,
tokens_starred,
alignment_tokenizer,
)
spans = get_spans(tokens_starred, segments, blank_token)
word_timestamps = postprocess_results(text_starred, spans, stride, scores)
- Downloads last month
- 2,792,873