This model always predicts some few nonsense sequences
Thank you for the information. Would you mind sharing the serving command and the evaluation prompts , which we can use to evaluate model quality when producing a new quantized version?
The issue has been tracked here. https://github.com/intel/auto-round/issues/1480
/root/miniconda3/envs/vllm-glm-int4/bin/python -m vllm.entrypoints.openai.api_server
--model $MODEL_ID
--served-model-name claude-opus-4-6
--port 80
--trust-remote-code
--max-model-len 202752
--tensor-parallel-size 8
--gpu-memory-utilization 0.85
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--max-num-seqs 16
Could you share some text inputs to reproduce this issue
It is difficult to reproduce, I use it in the claude code. But the cases like that often show up.

