NB this is correct as of 02/05/2026 – things are changing literally from day to day so mileage varies
Problem: vLLM nightly – tool calling broken and loops in self hosted RedHatAI/Qwen3.6-35B-A3B-NVFP4
Broken tool calls
- Disable speculative decoding (MTP). There are a few vllm regression bugs that seem to be triggered by streaming, which in turn is triggered by MTP even if the LLM client has streaming off
Looping
- Set a higher temperature (
1.0) - Manually set the
--kv-cache-dtypetofp8- I usually leave this as
autobut this was probably defaulting tobf16, setting it tofp8seems to have resolved looping, whenpresence_penaltyandrepitition_penaltydid not
- I usually leave this as
Update
My reasoning about this is the quantization of the kv cache was leading to a “failed successfully” scenario – the errors introduced by kv cache quant forced the model to be less sure of the next token and so avoid “lock in” to a loop. That led me to raise the repetition_penalty higher than I usually do (I’ve never had the reason to go above 1.0) while turning off the presence_penalty, but setting the --kv-cache-dtype back to auto and increasing the repetition_penalty to 1.1 did the trick
⬆️ repetition_penalty: avoid loop repeating the same token or set of tokens
⬇️ presence_penalty: avoid interminable thinking where LLM model spits out random words because it can neither repeat itself nor use the same token it previously used
Current working recepie
VLLM_MARLIN_USE_ATOMIC_ADD=1 uv run vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 262144 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.7 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --kv-cache-dtype auto \ --enable-prefix-caching \ --enable-chunked-prefill \ --load-format fastsafetensors \ --reasoning-parser qwen3 \ --language-model-only \ --chat-template chat-templates/qwen36.jinja \ --default-chat-template-kwargs '{"preserve_thinking": true}' \ --generation-config auto \ --override-generation-config '{"temperature": 1.0, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty":0.0, "repetition_penalty":1.1}' \ --moe_backend flashinfer_cutlass