microsoft
/

Phi-4-multimodal-instruct

Automatic Speech Recognition

text-generation

speech-summarization

speech-translation

visual-question-answering

phi-4-multimodal

Model card Files Files and versions

nguyenbh commited on Mar 4, 2025

Commit

607bf62

·

verified ·

1 Parent(s): a34dad0

update readme

Files changed (1) hide show

README.md +1 -3

README.md CHANGED Viewed

@@ -152,7 +152,7 @@ To understand the capabilities, Phi-4-multimodal-instruct  was compared with a s
 The Phi-4-multimodal-instruct was observed as
 - Having strong automatic speech recognition (ASR) and speech translation (ST) performance, surpassing expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large.
-- Ranking number 1 on the Huggingface OpenASR leaderboard with word error rate 6.14% in comparison with the current best model 6.5% as of Jan 17, 2025.
 - Being the first open-sourced model that can perform speech summarization, and the performance is close to GPT4o.
 - Having a gap with close models, e.g. Gemini-1.5-Flash and GPT-4o-realtime-preview, on speech QA task. Work is being undertaken to improve this capability in the next iterations.
@@ -468,8 +468,6 @@ response = processor.batch_decode(
 print(f'>>> Response\n{response}')
 ```
-**Notes**:
 ## Responsible AI Considerations
 Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:

 The Phi-4-multimodal-instruct was observed as
 - Having strong automatic speech recognition (ASR) and speech translation (ST) performance, surpassing expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large.
+- Ranking number 1 on the [Huggingface OpenASR](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) leaderboard with word error rate 6.14% in comparison with the current best model 6.5% as of March 04, 2025.
 - Being the first open-sourced model that can perform speech summarization, and the performance is close to GPT4o.
 - Having a gap with close models, e.g. Gemini-1.5-Flash and GPT-4o-realtime-preview, on speech QA task. Work is being undertaken to improve this capability in the next iterations.
 print(f'>>> Response\n{response}')
 ```
 ## Responsible AI Considerations
 Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: