Instructions to use litert-community/Matcha-TTS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use litert-community/Matcha-TTS with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Matcha-TTS โ LiteRT (on-device, FFT-free, GPU)
On-device English text-to-speech for Android via LiteRT CompiledModel. This is the
FFT-free TTS lane: Matcha-TTS pairs a
conditional flow-matching (CFM) acoustic model with a HiFi-GAN time-domain vocoder, so
there is no FFT/iSTFT anywhere in the synthesis path. 22.05 kHz, LJSpeech voice.
Converted from the official matcha_ljspeech + hifigan_T2_v1 checkpoints with
litert-torch, re-authored to be ML-Drift-GPU-clean
(per-graph tflite-vs-torch corr 1.000000; end-to-end waveform corr โฅ0.99). fp16 weights.
Files
| File | Size | In โ Out | Delegate (Pixel 8a) |
|---|---|---|---|
matcha_textenc_fp16.tflite |
15 MB | emb[1,256,192] + mask[1,1,256] โ mu[1,80,256], logw[1,1,256] | GPU |
matcha_decoder_fp16.tflite |
23 MB | x,mu[1,80,512] + t_sin[1,160] + mask[1,1,512] โ v[1,80,512] | CPUยน |
matcha_vocoder_fp16.tflite |
29 MB | mel[1,80,512] โ wav[1,1,131072] | GPU |
dp_g2p_matcha_fp16.tflite |
26 MB | text[1,96] (char ids) โ logits[1,96,64] (IPA) | CPU |
emb.bin |
0.1 MB | phoneme embedding table (178ร192 f32, host lookup) | host |
g2p_dict.txt.gz |
1.8 MB | 275k-entry espeak-IPA dictionary (primary G2P) | host |
config.json, g2p_meta.json |
โ | symbols, shapes, mel stats, G2P tokenizer tables | host |
ยน The CFM decoder runs on the CompiledModel CPU delegate. It converts GPU-clean and is correct on CPU, but the Mali ML Drift GPU delegate mis-fuses the decoder's transformer blocks at large activation magnitude (the same block is correct as a standalone GPU graph, corr 0.984, but collapses to corr 0.006 fused โ a graph-fusion bug, not a bad op). text encoder + vocoder run on the GPU; the GPU vocoder dominates wall time so the pipeline stays realtime (RTF ~0.8).
Pipeline (host orchestration)
text --G2P(CPU dict+neural)--> phoneme ids
--host: embed + intersperse + pad--> text_encoder(GPU) -> mu, logw
--host: durations + length-regulator--> mu_y[1,80,T]
--host: Euler ODE loop (N steps)--> decoder(CPU) x N -> v
--host: denormalize--> vocoder(GPU) -> waveform
Fixed shapes (256 phonemes, 512 mel frames โ 5.9 s); a runtime float mask makes padded positions a no-op so one compiled graph handles any length.
Minimal usage
Android (Kotlin, LiteRT CompiledModel)
fun load(name: String, acc: Accelerator) = // models staged in filesDir
CompiledModel.create(File(filesDir, name).absolutePath, CompiledModel.Options(acc), null)
val textenc = load("matcha_textenc_fp16.tflite", Accelerator.GPU)
val decoder = load("matcha_decoder_fp16.tflite", Accelerator.CPU) // Mali mis-fuses this graph on GPU
val vocoder = load("matcha_vocoder_fp16.tflite", Accelerator.GPU)
val teIn = textenc.createInputBuffers(); val teOut = textenc.createOutputBuffers()
teIn[0].writeFloat(emb) // [1,256,192] host phoneme-embedding lookup (emb.bin), blanks interspersed
teIn[1].writeFloat(tmask) // [1,1,256] 1 = real phoneme position
textenc.run(teIn, teOut) // -> mu[1,80,256], logw[1,1,256]
// host: durations ceil(exp(logw))ยท0.95 -> length-regulate mu -> mu_y[1,80,512]; 10 Euler steps of
// decoder(x, mu_y, t_sin[1,160], ymask[1,1,512]); mel = xยท2.116101 โ 5.536622 -> vocoder -> wav.
// Full pipeline: the text_to_speech (Matcha-TTS) sample in google-ai-edge/litert-samples.
Python (desktop verification)
import gzip, json, math, numpy as np, soundfile as sf
from ai_edge_litert.interpreter import Interpreter
MAXT, MAXM, LS = 256, 512, 0.95
cfg = json.load(open("config.json")) # symbols, mel stats, hop, sample rate
SYM = {s: i for i, s in enumerate(cfg["symbols"])}
DICT = dict(l.rstrip("\n").split("\t", 1) for l in
gzip.open("g2p_dict.txt.gz", "rt", encoding="utf-8") if "\t" in l)
emb = np.fromfile("emb.bin", "<f4").reshape(178, 192) # phoneme embedding table
def run(path, *ins):
it = Interpreter(model_path=path); it.allocate_tensors()
for d, x in zip(it.get_input_details(), ins): it.set_tensor(d["index"], x.astype(np.float32))
it.invoke(); return [it.get_tensor(o["index"]) for o in it.get_output_details()]
# text -> espeak-IPA -> symbol ids (dictionary G2P; the neural OOV fallback is skipped here)
ipa = " ".join(DICT[w] for w in "the quick brown fox jumps over the lazy dog".split()) + "."
pids = [SYM[c] for c in ipa if c in SYM]
ids = np.zeros(MAXT, np.int64); ids[1:2 * len(pids):2] = pids # intersperse blanks (id 0)
tmask = (np.arange(MAXT) < 2 * len(pids) + 1).astype(np.float32)[None, None]
mu, logw = sorted(run("matcha_textenc_fp16.tflite", emb[ids][None], tmask),
key=lambda a: -a.shape[1]) # mu[1,80,256], logw[1,1,256]
w = np.ceil(np.exp(logw[0, 0]) * tmask[0, 0]) * LS # durations -> length regulator
cum = np.cumsum(w); ylen = int(min(max(cum[-1], 1), MAXM))
mu_y = np.zeros((1, 80, MAXM), np.float32)
mu_y[0, :, :ylen] = mu[0][:, np.searchsorted(cum, np.arange(ylen), "right").clip(max=MAXT - 1)]
ymask = (np.arange(MAXM) < ylen).astype(np.float32)[None, None]
def t_sin(t, half=80): # sinusoidal ODE-time embedding
e = 1000.0 * t * np.exp(np.arange(half) * -math.log(10000) / (half - 1))
return np.concatenate([np.sin(e), np.cos(e)]).astype(np.float32)[None]
x = np.zeros((1, 80, MAXM), np.float32) # Euler ODE, 10 steps
x[0, :, :ylen] = np.random.randn(80, ylen); N = 10
for k in range(N):
x += run("matcha_decoder_fp16.tflite", x, mu_y, t_sin(k / N), ymask)[0] / N
mel = np.zeros_like(x); mel[0, :, :ylen] = x[0, :, :ylen] * cfg["mel_std"] + cfg["mel_mean"]
wav = run("matcha_vocoder_fp16.tflite", mel)[0].reshape(-1)[:ylen * cfg["hop"]]
sf.write("out.wav", np.clip(wav, -1, 1), cfg["sample_rate"])
G2P (espeak-free)
Matcha-LJSpeech is trained on espeak en-us IPA, but espeak is GPL. The clean replacement is a 275k-entry espeak-IPA dictionary (from OpenPhonemizer, Clear BSD) as primary + DeepPhonemizer (MIT) on LiteRT CPU for out-of-dictionary words. Output IPA maps 1:1 onto the keithito 178-symbol set.
Sample
See the LiteRT compiled_model_api/text_to_speech sample (Matcha-TTS) in
google-ai-edge/litert-samples for the full
Android app and the conversion scripts.
License
Model: MIT (Matcha-TTS / HiFi-GAN). G2P dict: Clear BSD (OpenPhonemizer) + MIT (DeepPhonemizer).
- Downloads last month
- 43
