Sori-4B-FC (Function Calling)
Speech-to-Function-Calling model that listens to Korean speech and generates tool calls.
Fine-tuned from Seungyoun/Sori-4B-Base on Seungyoun/xlam-function-calling-60k-audio-kor using LoRA on the LLM (Qwen3-4B-Instruct).
Architecture
Korean Speech (Mel Spectrogram) โ Qwen3-Omni Audio Encoder โ Audio Projection โ Qwen3-4B LLM โ Tool Calls
- Audio Encoder: Qwen3-Omni-30B-A3B-Instruct (frozen)
- Audio Projection: Linear projection layer (frozen, from Stage 1)
- Language Model: Qwen3-4B-Instruct-2507 (LoRA fine-tuned, r=16, alpha=32)
Training
- Stage 1: Audio encoder + projection alignment on ASR data (Sori-4B-Base)
- Stage 2 (this model): LoRA fine-tuning on 18K Korean function-calling audio samples
- Only the assistant's tool_call tokens are trained; all other tokens are masked
- Tools are provided via Qwen3's native chat template (
toolsparameter) - 5 epochs, batch size 32 (effective), lr 2e-5
Usage
from modeling_sori_speech import SoriSpeechForConditionalGeneration
from processing_sori_speech import SoriSpeechProcessor
from sori_speech_utils import process_mm_info
import torch
model = SoriSpeechForConditionalGeneration.from_pretrained(
"Seungyoun/Sori-4B-FC",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.eval()
processor = SoriSpeechProcessor.from_pretrained("Seungyoun/Sori-4B-FC")
# Define tools (Qwen3 chat template format)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"],
},
},
},
]
# Build conversation with audio input
conversation = [
{
"role": "system",
"content": "You are a helpful voice assistant that can understand Korean speech and call tools when needed.",
},
{
"role": "user",
"content": [{"type": "audio", "audio": "weather.mp3"}],
},
]
# Process
text = processor.apply_chat_template(
conversation, tools=tools, add_generation_prompt=True, tokenize=False,
)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, return_tensors="pt", padding=True)
inputs = {
k: v.to(model.device).to(model.dtype)
if isinstance(v, torch.Tensor) and v.is_floating_point()
else v.to(model.device) if isinstance(v, torch.Tensor) else v
for k, v in inputs.items()
}
# Generate
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
)
result = processor.decode(output_ids[0], skip_special_tokens=False)
result = result.replace("<|im_end|>", "").replace("<|endoftext|>", "").strip()
print(result)
# <tool_call>
# {"name": "get_weather", "arguments": {"city": "Seoul"}}
# </tool_call>
Example
| Audio (Korean Speech) | Generated Tool Call |
|---|---|
| "ํน์ ์ง๊ธ ์์ธ ๋ ์จ๊ฐ ์ด๋ป๊ฒ๋ผ?" | get_weather({"city": "Seoul"}) |
License
Apache 2.0
- Downloads last month
- 38
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for Seungyoun/Sori-4B-FC
Base model
Seungyoun/Sori-4B-Base