Automatic Speech Recognition
Transformers
Safetensors
phi4mm
text-generation
nlp
code
audio
speech-summarization
speech-translation
visual-question-answering
phi-4-multimodal
phi
phi-4-mini
custom_code
Eval Results
Instructions to use microsoft/Phi-4-multimodal-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-4-multimodal-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
update readme
Browse files
README.md
CHANGED
|
@@ -152,7 +152,7 @@ To understand the capabilities, Phi-4-multimodal-instruct was compared with a s
|
|
| 152 |
|
| 153 |
The Phi-4-multimodal-instruct was observed as
|
| 154 |
- Having strong automatic speech recognition (ASR) and speech translation (ST) performance, surpassing expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large.
|
| 155 |
-
- Ranking number 1 on the Huggingface OpenASR leaderboard with word error rate 6.14% in comparison with the current best model 6.5% as of
|
| 156 |
- Being the first open-sourced model that can perform speech summarization, and the performance is close to GPT4o.
|
| 157 |
- Having a gap with close models, e.g. Gemini-1.5-Flash and GPT-4o-realtime-preview, on speech QA task. Work is being undertaken to improve this capability in the next iterations.
|
| 158 |
|
|
@@ -468,8 +468,6 @@ response = processor.batch_decode(
|
|
| 468 |
print(f'>>> Response\n{response}')
|
| 469 |
```
|
| 470 |
|
| 471 |
-
**Notes**:
|
| 472 |
-
|
| 473 |
## Responsible AI Considerations
|
| 474 |
|
| 475 |
Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
|
|
|
|
| 152 |
|
| 153 |
The Phi-4-multimodal-instruct was observed as
|
| 154 |
- Having strong automatic speech recognition (ASR) and speech translation (ST) performance, surpassing expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large.
|
| 155 |
+
- Ranking number 1 on the [Huggingface OpenASR](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) leaderboard with word error rate 6.14% in comparison with the current best model 6.5% as of March 04, 2025.
|
| 156 |
- Being the first open-sourced model that can perform speech summarization, and the performance is close to GPT4o.
|
| 157 |
- Having a gap with close models, e.g. Gemini-1.5-Flash and GPT-4o-realtime-preview, on speech QA task. Work is being undertaken to improve this capability in the next iterations.
|
| 158 |
|
|
|
|
| 468 |
print(f'>>> Response\n{response}')
|
| 469 |
```
|
| 470 |
|
|
|
|
|
|
|
| 471 |
## Responsible AI Considerations
|
| 472 |
|
| 473 |
Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
|