--- language: - en - zh license: mit tags: - audio tokenizer library_name: transformers pipeline_tag: feature-extraction --- # 🚨 _Note: This is a draft model card. Actual model links can be found in [this collection](https://huggingface.co/collections/bezzam/vibevoice)._ # VibeVoice-SemanticTokenizer VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details. The model can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1-2 speaker limits of many prior models. ➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205) ➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice) # Models 🚨 _Note: This is a draft model card. Actual model links can be found in [this collection](https://huggingface.co/collections/bezzam/vibevoice)._ | Model | Context Length | Generation Length | Weight | |-------|----------------|----------|----------| | VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) | | VibeVoice-7B| 32K | ~45 min | [HF link](https://huggingface.co/microsoft/VibeVoice-7B) | | VibeVoice-AcousticTokenizer | - | - | [HF link](https://huggingface.co/microsoft/VibeVoice-AcousticTokenizer) | | VibeVoice-SemanticTokenizer | - | - | This model | # Usage Below is example usage to encode audio for extracting semantic features: ```python import torch from transformers import AutoFeatureExtractor, VibeVoiceSemanticTokenizerModel from transformers.audio_utils import load_audio_librosa model_id = "bezzam/VibeVoice-SemanticTokenizer" sampling_rate = 24000 # load audio audio = load_audio_librosa( "https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav", sampling_rate=sampling_rate, ) # load model device = "cuda" if torch.cuda.is_available() else "cpu" feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) model = VibeVoiceSemanticTokenizerModel.from_pretrained( model_id, device_map=device, ).eval() # preprocess audio inputs = feature_extractor( audio, sampling_rate=sampling_rate, padding=True, pad_to_multiple_of=3200, return_attention_mask=False, return_tensors="pt", ).to(device) print("Input audio shape:", inputs.input_features.shape) # Input audio shape: torch.Size([1, 1, 224000]) # encode with torch.no_grad(): encoded_outputs = model.encode(inputs.input_features) print("Latent shape:", encoded_outputs.latents.shape) # Latent shape: torch.Size([1, 70, 128]) ```