Link model to paper and add metadata
Browse filesHi! I'm Niels from the community science team at Hugging Face.
This PR improves the model card by:
- Adding the ArXiv ID (`2512.23808`) to the metadata to link this repository to the [Hugging Face paper page](https://huggingface.co/papers/2512.23808).
- Adding `library_name: transformers` to the metadata as the `config.json` indicates compatibility.
- Updating the Markdown content to include direct links to the research paper for better accessibility.
These changes help users discover the research behind the model and utilize Hub features more effectively.
README.md
CHANGED
|
@@ -1,13 +1,16 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: any-to-any
|
|
|
|
| 4 |
tags:
|
| 5 |
- Audio-to-Text
|
| 6 |
- Text-to-Audio
|
| 7 |
- Audio-to-Audio
|
| 8 |
- Text-to-Text
|
| 9 |
- Audio-Text-to-Text
|
|
|
|
| 10 |
---
|
|
|
|
| 11 |
<div align="center">
|
| 12 |
<picture>
|
| 13 |
<source srcset="https://github.com/XiaomiMiMo/MiMo-VL/raw/main/figures/Xiaomi_MiMo_darkmode.png?raw=true" media="(prefers-color-scheme: dark)">
|
|
@@ -32,7 +35,7 @@ tags:
|
|
| 32 |
|
|
| 33 |
<a href="https://github.com/XiaomiMiMo/MiMo-Audio" target="_blank">π€ GitHub</a>
|
| 34 |
|
|
| 35 |
-
<a href="https://
|
| 36 |
|
|
| 37 |
<a href="https://xiaomimimo.github.io/MiMo-Audio-Demo" target="_blank">π° Blog</a>
|
| 38 |
|
|
|
@@ -50,6 +53,8 @@ tags:
|
|
| 50 |
|
| 51 |
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models.
|
| 52 |
|
|
|
|
|
|
|
| 53 |
<p align="center">
|
| 54 |
<img width="95%" src="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/assets/Results.png?raw=true">
|
| 55 |
</p>
|
|
@@ -73,7 +78,7 @@ MiMo-Audio couples a patch encoder, an LLM, and a patch decoder to improve model
|
|
| 73 |
## Explore MiMo-Audio Now! πππ
|
| 74 |
- π§ **Try the Hugging Face demo:** [MiMo-Audio Demo](https://huggingface.co/spaces/XiaomiMiMo/mimo_audio_chat)
|
| 75 |
- π° **Read the Official Blog:** [MiMo-Audio Blog](https://xiaomimimo.github.io/MiMo-Audio-Demo)
|
| 76 |
-
- π **Dive into the Technical Report:** [MiMo-Audio Technical Report](https://
|
| 77 |
|
| 78 |
|
| 79 |
## Model Download
|
|
@@ -158,7 +163,7 @@ This toolkit is designed to evaluate MiMo-Audio and other recent audio LLMs as m
|
|
| 158 |
title={MiMo-Audio: Audio Language Models are Few-Shot Learners},
|
| 159 |
author={LLM-Core-Team Xiaomi},
|
| 160 |
year={2025},
|
| 161 |
-
url={
|
| 162 |
}
|
| 163 |
```
|
| 164 |
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: any-to-any
|
| 4 |
+
library_name: transformers
|
| 5 |
tags:
|
| 6 |
- Audio-to-Text
|
| 7 |
- Text-to-Audio
|
| 8 |
- Audio-to-Audio
|
| 9 |
- Text-to-Text
|
| 10 |
- Audio-Text-to-Text
|
| 11 |
+
arxiv: 2512.23808
|
| 12 |
---
|
| 13 |
+
|
| 14 |
<div align="center">
|
| 15 |
<picture>
|
| 16 |
<source srcset="https://github.com/XiaomiMiMo/MiMo-VL/raw/main/figures/Xiaomi_MiMo_darkmode.png?raw=true" media="(prefers-color-scheme: dark)">
|
|
|
|
| 35 |
|
|
| 36 |
<a href="https://github.com/XiaomiMiMo/MiMo-Audio" target="_blank">π€ GitHub</a>
|
| 37 |
|
|
| 38 |
+
<a href="https://huggingface.co/papers/2512.23808" target="_blank">π Paper</a>
|
| 39 |
|
|
| 40 |
<a href="https://xiaomimimo.github.io/MiMo-Audio-Demo" target="_blank">π° Blog</a>
|
| 41 |
|
|
|
|
|
| 53 |
|
| 54 |
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models.
|
| 55 |
|
| 56 |
+
This repository contains the model weights for **MiMo-Audio**, presented in the paper [MiMo-Audio: Audio Language Models are Few-Shot Learners](https://huggingface.co/papers/2512.23808).
|
| 57 |
+
|
| 58 |
<p align="center">
|
| 59 |
<img width="95%" src="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/assets/Results.png?raw=true">
|
| 60 |
</p>
|
|
|
|
| 78 |
## Explore MiMo-Audio Now! πππ
|
| 79 |
- π§ **Try the Hugging Face demo:** [MiMo-Audio Demo](https://huggingface.co/spaces/XiaomiMiMo/mimo_audio_chat)
|
| 80 |
- π° **Read the Official Blog:** [MiMo-Audio Blog](https://xiaomimimo.github.io/MiMo-Audio-Demo)
|
| 81 |
+
- π **Dive into the Technical Report:** [MiMo-Audio Technical Report](https://huggingface.co/papers/2512.23808)
|
| 82 |
|
| 83 |
|
| 84 |
## Model Download
|
|
|
|
| 163 |
title={MiMo-Audio: Audio Language Models are Few-Shot Learners},
|
| 164 |
author={LLM-Core-Team Xiaomi},
|
| 165 |
year={2025},
|
| 166 |
+
url={https://huggingface.co/papers/2512.23808},
|
| 167 |
}
|
| 168 |
```
|
| 169 |
|