KoboldCpp Transcribe API not working

#2
by henryclw - opened

Hi @concedo , thank you for providing this GGUF file as you are the only one providing the GLM ASR Nao GGUF file.

In this model card description, you said we could use this GGUF file with KoboldCpp 1.104 and above. I tried but when into some difficulties in accessing the transcribe API

Here are more details:

I run the KoboldCpp with docker compose:

services:
  koboldcpp:
    container_name: koboldcpp
    image: koboldai/koboldcpp:latest
    volumes:
      - /path_to_model_files/:/workspace/:rw
    deploy: # You can remove this section if you do not wish to use an Nvidia GPU
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]
    environment:
      - KCPP_DONT_UPDATE=true
      - KCPP_DONT_TUNNEL=true
      - KCPP_ARGS=--model GLM-ASR-Nano-1.6B-2512-Q8_0.gguf --mmproj mmproj-GLM-ASR-Nano-2512-Q8_0.gguf --gpulayers 99
    ports:
      - "5001:5001"
    restart: unless-stopped

The KoboldCpp starts with no obvious issue, here are the logs:

Update check skipped
***
Welcome to KoboldCpp - Version 1.104
Loading Chat Completions Adapter: /tmp/_MEIz6Ojhs/kcpp_adapters/AutoGuess.json
Chat Completions Adapter Loaded
No GPU or CPU backend was selected. Trying to assign one for you automatically...
Auto Selected CUDA Backend (flag=0)

System: Linux #1 SMP PREEMPT_DYNAMIC Fri, 19 Dec 2025 01:23:45 +0000 x86_64 x86_64
Detected Available GPU Memory: 24576 MB
Detected Available RAM: 104038 MB
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(admin=False, admindir='', adminpassword=None, analyze='', autofit=False, batchsize=512, benchmark=None, blasthreads=0, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=8192, debugmode=0, defaultgenamt=896, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel='', embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=False, foreground=False, genlimit=0, gpulayers=99, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=False, jinja_tools=False, launch=False, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mmproj='mmproj-GLM-ASR-Nano-2512-Q8_0.gguf', mmprojcpu=False, model=['GLM-ASR-Nano-1.6B-2512-Q8_0.gguf'], model_param='GLM-ASR-Nano-1.6B-2512-Q8_0.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv='', overridenativecontext=0, overridetensors='', password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory='', prompt='', quantkv=0, quiet=True, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile='', sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults='', sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=0, sdtiledvae=768, sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=False, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=15, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=['normal', 'mmq'], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: /workspace/GLM-ASR-Nano-1.6B-2512-Q8_0.gguf
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:2b:00.0) - 23862 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 255 tensors from /workspace/GLM-ASR-Nano-1.6B-2512-Q8_0.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 1.58 GiB (8.50 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 59246 ('<|endoftext|>')
load:   - 59253 ('<|user|>')
load: special tokens cache size = 17
load: token to piece cache size = 0.3378 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 1.59 B
print_info: general.name     = GLM ASR Nano 2512
print_info: vocab type       = BPE
print_info: n_vocab          = 59264
print_info: n_merges         = 106026
print_info: BOS token        = 59246 '<|endoftext|>'
print_info: EOS token        = 59246 '<|endoftext|>'
print_info: EOT token        = 59253 '<|user|>'
print_info: UNK token        = 59246 '<|endoftext|>'
print_info: PAD token        = 59246 '<|endoftext|>'
print_info: LF token         = 10 'Ċ'
print_info: EOG token        = 59246 '<|endoftext|>'
print_info: EOG token        = 59253 '<|user|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 0 of 255
load_tensors: offloading output layer to GPU
load_tensors: offloading 27 repeating layers to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        CUDA0 model buffer size =  1491.93 MiB
load_tensors:    CUDA_Host model buffer size =   122.98 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8448
llama_context: n_ctx_seq     = 8448
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8448) > n_ctx_train (8192) -- possible training context overflow
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.23 MiB
llama_kv_cache:      CUDA0 KV buffer size =   462.00 MiB
llama_kv_cache: size =  462.00 MiB (  8448 cells,  28 layers,  1/1 seqs), K (f16):  231.00 MiB, V (f16):  231.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2040
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
llama_context:      CUDA0 compute buffer size =   298.51 MiB
llama_context:  CUDA_Host compute buffer size =    22.51 MiB
llama_context: graph nodes  = 1014
llama_context: graph splits = 2
attach_threadpool: call
clip_model_loader: model name:   GLM ASR Nano 2512
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    493
clip_model_loader: n_kv:         22

clip_model_loader: has audio encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          glma
load_hparams: n_embd:             1280
load_hparams: n_head:             20
load_hparams: n_ff:               5120
load_hparams: n_layer:            32
load_hparams: ffn_op:             gelu_erf
load_hparams: projection_dim:     2048

--- audio hparams ---
load_hparams: n_mel_bins:         128
load_hparams: proj_stack_factor:  4
load_hparams: audio_chunk_len:    30
load_hparams: audio_sample_rate:  16000
load_hparams: audio_n_fft:        400
load_hparams: audio_window_len:   400
load_hparams: audio_hop_len:      160

load_hparams: model size:         686.82 MiB
load_hparams: metadata size:      0.20 MiB
load_tensors: loaded 493 tensors from /workspace/mmproj-GLM-ASR-Nano-2512-Q8_0.gguf
Load Text Model OK: True
Chat completion heuristic: Phi 3.5
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
======
Active Modules: TextGeneration MultimodalAudio
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
======
Please connect to custom endpoint at http://localhost:5001

However, when I access the web UI, the voice input button is disabled. And when I tried the /api/extra/transcribe API, it always returns HTTP 200 with this response body:

{
  "text": ""
}

May I ask if there is any plan to let this GLM-ASR-Nano-2512 model works as a Whisper model? Or there are some fundamental difference between these two things and the GLM-ASR-Nano-2512 model is not working straight forward as a STT model?

Owner

Hi, can you try again with v1.105.2

Sign up or log in to comment