fine tune for LLM tasks

#19

by itaipee - opened Nov 13, 2025

Nov 13, 2025

Hi
The model is interesting , there are plenty of tasks i think i would like to , where the mix of LLM and audio features can work together to produce more accurate output:

Split transcription into speakers.
contextual biasing - increase accuracy for domain specific words in the transcription (e.g. medical terms , airports names.. ) that are more likely to be mis-transcribed in the regular transcription .
Summarize the audio.
Private information Reduction (remove credit card and other such info)

So my question is how do i do it ?
Just entering prompt does not work. I tried that , the transcription works fine ,very accurate , but no matter what i ask in the prompt , it just gives me same transcription with small variations.

So, i probably need to fine tune , at least the adapter , the Q-Former projector and the LLM.
There is script in here - which seems to do just that according to what i see in the code , but in the description gives an impression it is more for make the speech recognition better for specific domain/accent or whatever.
https://colab.research.google.com/github/ibm-granite/granite-speech-models/blob/main/notebooks/fine_tuning_granite_speech.ipynb

Also, in the paper you describe testing it in transcription and translation. Did you test it on mote LLM oriented tasks ?

itaipee changed discussion title from fine tune t to fine tune for LLM tasks Nov 13, 2025

gsaon

IBM Granite org Nov 13, 2025

Hi. Thanks for finding the model interesting. As you have noticed, the model is only able to either transcribe speech in the original language (out of 5 supported languages) or to translate it into a different language (out of 7). It is not capable to do things like summarization, spoken question answering, chat completion from speech etc. This is by design because we wanted the model to function in two separate passes: first transcription then apply summarization, QA, chat completion etc. to the transcribed text using the underlying text LLM so that we can audit the quality and harmfulness of the spoken input. This two pass approach with granite LLM reuse is shown here for spoken QA: https://github.com/ibm-granite/granite-speech-models/blob/main/notebooks/two_pass_spoken_qa.ipynb
Otherwise we would have to implement safety/toxicity detection and guardrails from audio which was outside the scope of this work. So, in short, you would have to finetune granite-speech using the notebook that you pointed to in order to do the other spoken tasks that you mentioned in one pass. You would have to add different prompts for the new tasks by modifying the prep_example function in the FT notebook.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment