Sori-4B-FC (Function Calling)

Speech-to-Function-Calling model that listens to Korean speech and generates tool calls.

Fine-tuned from Seungyoun/Sori-4B-Base on Seungyoun/xlam-function-calling-60k-audio-kor using LoRA on the LLM (Qwen3-4B-Instruct).

Architecture

Korean Speech (Mel Spectrogram) โ†’ Qwen3-Omni Audio Encoder โ†’ Audio Projection โ†’ Qwen3-4B LLM โ†’ Tool Calls
  • Audio Encoder: Qwen3-Omni-30B-A3B-Instruct (frozen)
  • Audio Projection: Linear projection layer (frozen, from Stage 1)
  • Language Model: Qwen3-4B-Instruct-2507 (LoRA fine-tuned, r=16, alpha=32)

Training

  • Stage 1: Audio encoder + projection alignment on ASR data (Sori-4B-Base)
  • Stage 2 (this model): LoRA fine-tuning on 18K Korean function-calling audio samples
    • Only the assistant's tool_call tokens are trained; all other tokens are masked
    • Tools are provided via Qwen3's native chat template (tools parameter)
    • 5 epochs, batch size 32 (effective), lr 2e-5

Usage

from modeling_sori_speech import SoriSpeechForConditionalGeneration
from processing_sori_speech import SoriSpeechProcessor
from sori_speech_utils import process_mm_info
import torch

model = SoriSpeechForConditionalGeneration.from_pretrained(
    "Seungyoun/Sori-4B-FC",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()
processor = SoriSpeechProcessor.from_pretrained("Seungyoun/Sori-4B-FC")

# Define tools (Qwen3 chat template format)
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"}
                },
                "required": ["city"],
            },
        },
    },
]

# Build conversation with audio input
conversation = [
    {
        "role": "system",
        "content": "You are a helpful voice assistant that can understand Korean speech and call tools when needed.",
    },
    {
        "role": "user",
        "content": [{"type": "audio", "audio": "weather.mp3"}],
    },
]

# Process
text = processor.apply_chat_template(
    conversation, tools=tools, add_generation_prompt=True, tokenize=False,
)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, return_tensors="pt", padding=True)
inputs = {
    k: v.to(model.device).to(model.dtype)
    if isinstance(v, torch.Tensor) and v.is_floating_point()
    else v.to(model.device) if isinstance(v, torch.Tensor) else v
    for k, v in inputs.items()
}

# Generate
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
    )

result = processor.decode(output_ids[0], skip_special_tokens=False)
result = result.replace("<|im_end|>", "").replace("<|endoftext|>", "").strip()
print(result)
# <tool_call>
# {"name": "get_weather", "arguments": {"city": "Seoul"}}
# </tool_call>

Example

Audio (Korean Speech) Generated Tool Call
"ํ˜น์‹œ ์ง€๊ธˆ ์„œ์šธ ๋‚ ์”จ๊ฐ€ ์–ด๋–ป๊ฒŒ๋ผ?" get_weather({"city": "Seoul"})

License

Apache 2.0

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Seungyoun/Sori-4B-FC

Finetuned
(1)
this model