vosk-model-ja-medical

Japanese medical speech recognition model based on vosk-model-ja-0.22, fine-tuned with the DMiME (Dictionary of Medical terms in MEdical informatics) Japanese medical dictionary (~42,000 medical terms). Training data was created by synthesizing speech with Azure Speech Service and Google Cloud TTS from medical terms extracted from DMiME.

Overview

This model extends the standard Vosk Japanese model with medical vocabulary including disease names, drug names, clinical procedures, and anatomical terms. The language model was adapted using SRILM interpolation with morphologically-segmented medical text corpus.

Performance

Evaluated on 500 TTS-synthesized medical utterances (Azure Speech ja-JP-NanamiNeural):

Model	CER	Change
vosk-model-ja-0.22 (baseline)	14.17%	—
vosk-model-ja-medical	12.60%	-1.57%

Improved: 137 / Same: 226 / Degraded: 137 (out of 500)
Best improvements on short medical terms (≤10 chars): -2.0% CER

Example improvements

Reference	Baseline	Medical model
スルトプリド塩酸塩の経過を観察する	するとプリ人塩酸塩の経過を観察する	スルトプリド塩酸塩の経過を観察する
下腿挫滅創	硬い挫滅そう	下腿挫滅創
会陰部裂傷縫合不全	遠因武烈性縫合不全	会陰部裂傷縫合不全
肺エキノコックス症	廃液のコックス生	肺エキノコックス症
アクリノール消毒液です	悪意の居る消毒液です	アクリノール消毒液です

Usage

from vosk import Model, KaldiRecognizer
import wave
import json

model = Model("vosk-model-ja-medical")
wf = wave.open("audio.wav", "rb")
rec = KaldiRecognizer(model, wf.getframerate())

while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    rec.AcceptWaveform(data)

result = json.loads(rec.FinalResult())
print(result["text"])

Model details

Base model: vosk-model-ja-0.22 (Alphacephei)
Acoustic model: Unchanged (Kaldi nnet3/chain TDNN)
Language model: SRILM 4-gram, Witten-Bell discounting, interpolated with base LM (lambda=0.95)
Vocabulary: ~350,000 words (base 309K + medical 42K, morphologically segmented)
Medical dictionary: DMiME v1.1 (42,467 entries)
Morphological analysis: fugashi (MeCab + UniDic-lite) for compound term segmentation
Medical collocation boost: 5,758 medical suffix patterns (性, 症, 炎, 腫, 癌, 病) repeated for LM emphasis

Adaptation method

DMiME terms parsed and morphologically segmented using fugashi
New lexicon entries generated with hiragana-to-Kaldi phone mapping
Medical text corpus (197K sentences) generated from templates with space-separated morphemes
SRILM: Witten-Bell 4-gram trained on medical corpus, interpolated with base LM (lambda=0.95)
Kaldi: prepare_lang.sh + mkgraph.sh to rebuild HCLG.fst
Rescore with full interpolated LM (G.carpa, 1.1GB)

Limitations

Homophones: Words like 性/製/勢 (all "sei") may be confused in non-medical contexts
Long compound terms: Very long medical terms (>10 chars) may still be split into common words
Evaluation: Tested on TTS-synthesized speech only; real clinical speech may differ
GPL2 license: Modifications and redistributions must remain under GPL2

License

This model is licensed under GPL-2.0 (GNU General Public License ver. 2), inheriting from the DMiME dictionary license.

Base Vosk model: Apache 2.0 (compatible with GPL2 redistribution)
Combined model: GPL-2.0

DMiME is built on top of ORCA's kana-kanji conversion medical dictionary (also GPL2), developed by the Japan Medical Association Research Institute.

Citation

If you use this model, please cite:

@misc{vosk-model-ja-medical,
  title={vosk-model-ja-medical: Japanese Medical Speech Recognition Model},
  author={kenrouse},
  year={2026},
  url={https://huggingface.co/kenrouse/vosk-model-ja-medical}
}

Acknowledgments

Alphacephei / Vosk — Base model and compile package
DMiME (Dictionary of Medical terms in MEdical informatics) — Japanese medical dictionary
Nickolay Shmyrev — Technical guidance and SRILM pointer

Downloads last month: -; Downloads are not tracked for this model. How to track