Røst-v3-chatterbox-500m

This is a Danish open sourced state-of-the-art Text to Speech (TTS) model. It is trained as part of the CoRal project by the Alexandra Institute.

Examples

Settings Mic Nic
Dette er en tekst-til-tale model, trænet til at oplæse dansk. Lyt med og bedøm selv, hvordan den håndterer sprogets mange forskellige lyde og rytmer.
temp=0.8, top_p=0.95, c=0.5, min_p=0.05, repetition_pen=2.0
temp=0.6, top_p=0.95, c=0.5, min_p=0.05, repetition_pen=2.0
København er Danmarks hovedstad og ligger på øerne Sjælland og Amager, hvor mange turister besøger de smukke kanaler og historiske bygninger.
temp=0.8, top_p=0.95, c=0.5, min_p=0.05, repetition_pen=2.0
temp=0.6, top_p=0.95, c=0.3, min_p=0.05, repetition_pen=2.0

Inference Quickstart

Start by installing the required libraries:

$ pip install chatterbox-tts huggingface_hub torchaudio

To run a simple inference:

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

REPO_ID = "CoRal-project/roest-v3-chatterbox-500m"
device = "cuda" # Change to cpu if no GPU is available

# Load the model
model_dir = snapshot_download(
        repo_id=REPO_ID,
        token=os.getenv("HF_TOKEN") or True,
        # Optional: Filter to download only what you need
        allow_patterns=["*.safetensors", "*.json", "*.txt", "*.pt", "*.model"],
    )
model = ChatterboxMultilingualTTS.from_local(model_dir, device=device)

text = "Hej, hvordan går det? Jeg er dansk Chatterbox, og taler naturligt dansk."
wav = model.generate(text, language_id="da")
ta.save("test.wav", wav, model.sr)

# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
wav = model.generate(text, language_id="da", audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("test-2.wav", wav, model.sr)

Model Details

The model is a finetuned variant of Chatterbox Multilingual, one of the leading open-source text-to-speech models. The base Chatterbox model is built on a 0.5B Llama backbone and trained on more than 500,000 hours of high-quality multilingual speech data across 23 languages, including Danish.

We finetuned the model on over 2000 hours of Danish speech to further improve performance and reliability. The model supports zero-shot voice cloning on as little as a 10 seconds audio prompt. It is compatible with the original Chatterbox library, thus being easy to setup and retaining the built-in Watermarked outputs.

The model works very well with the two predefined speakers from the CoRal-tts dataset, Mic and Nic.

Known limitations

  • Though the original model supports exaggeration/intensity control, this does not, as the dataset does not support it.
  • The model does not handle longer text inputs. Others have implemented streaming for this use case., and we recommend splitting longer texts into sentences.
  • The model only supports Danish and English with a heavy Danish accent.

Evaluation

The model was evaluated using Mean Opinion Score (MOS), achieving a score of 4.23 as rated by a panel of 20 native Danish speakers.

The evaluation used a set of 10 samples for two different speakers, Mic and Nic. Samples were generated with temp=0.7, top_p=0.95, top_k=600. The MOS scale was defined as follows:

Score Rating Description
1.0 Bad Speech sounds completely unnatural and artificial. The synthetic voice is very obvious and so distracting that it can be hard to listen to.
2.0 Poor Speech sounds mostly unnatural. The synthetic quality is obvious and feels distracting, but still bearable.
3.0 Fair Speech appears both natural and unnatural in roughly equal measure. It's clear the voice is synthetic, and it can be somewhat distracting.
4.0 Good Speech sounds mostly natural. You can tell it's synthetic speech, but it's only slightly noticeable and not distracting.
5.0 Excellent Speech sounds completely natural and cannot be distinguished from a real human voice. There are no audible signs of synthetic speech.

Training notes

The following hyperparameters were used during training:

Parameter Value
Epochs 2
Batch size (per device) 4
Gradient accumulation steps 6
Learning rate 8.0e-5
Warmup steps 200
Gradient checkpointing Disabled
Max gradient norm 1.35
Seed 42
Eval split size 0.1%

Training Data

The model was trained on the following datasets:

Creators and Funders

This model has been trained and the model card written by Daniel Christopher Biørrith at Alexandra Institute.

The CoRal project is funded by the Danish Innovation Fund and consists of the following partners:

Citation

@misc{roest-v3-chatterbox-500m,
  author    = {Daniel Christopher Biørrith, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
  title     = {Røst-v3-chatterbox-500m: A Danish state-of-the-art text-to-speech model},
  year      = {2026},
  url       = {https://huggingface.co/CoRal-project/roest-v3-chatterbox-500m},
}
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CoRal-project/roest-v3-chatterbox-500m

Finetuned
(35)
this model

Datasets used to train CoRal-project/roest-v3-chatterbox-500m

Space using CoRal-project/roest-v3-chatterbox-500m 1

Collection including CoRal-project/roest-v3-chatterbox-500m