PSA: Hard lessons learned with Qwen3.6-35B-A3B

NB this is correct as of 02/05/2026 – things are changing literally from day to day so mileage varies

Problem: vLLM nightly – tool calling broken and loops in self hosted RedHatAI/Qwen3.6-35B-A3B-NVFP4

Broken tool calls

Disable speculative decoding (MTP). There are a few vllm regression bugs that seem to be triggered by streaming, which in turn is triggered by MTP even if the LLM client has streaming off

Looping

Set a higher temperature (1.0)
Manually set the --kv-cache-dtype to fp8
- I usually leave this as auto but this was probably defaulting to bf16, setting it to fp8 seems to have resolved looping, when presence_penalty and repitition_penalty did not

Update

My reasoning about this is the quantization of the kv cache was leading to a “failed successfully” scenario – the errors introduced by kv cache quant forced the model to be less sure of the next token and so avoid “lock in” to a loop. That led me to raise the repetition_penalty higher than I usually do (I’ve never had the reason to go above 1.0) while turning off the presence_penalty, but setting the --kv-cache-dtype back to auto and increasing the repetition_penalty to 1.1 did the trick

⬆️ repetition_penalty: avoid loop repeating the same token or set of tokens
⬇️ presence_penalty: avoid interminable thinking where LLM model spits out random words because it can neither repeat itself nor use the same token it previously used

Current working recepie

			
VLLM_MARLIN_USE_ATOMIC_ADD=1 uv run vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
   --host 0.0.0.0 \
   --port 8000 \
   --max-model-len 262144 \
   --max-num-batched-tokens 16384 \
   --gpu-memory-utilization 0.7 \
   --enable-auto-tool-choice \
   --tool-call-parser qwen3_coder \
   --kv-cache-dtype auto \
   --enable-prefix-caching \
   --enable-chunked-prefill \
   --load-format fastsafetensors \
   --reasoning-parser qwen3 \
   --language-model-only \
   --chat-template chat-templates/qwen36.jinja \
   --default-chat-template-kwargs '{"preserve_thinking": true}' \
   --generation-config auto \
   --override-generation-config '{"temperature": 1.0, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty":0.0, "repetition_penalty":1.1}' \
   --moe_backend flashinfer_cutlass 

		

Update

Current working recepie

Share this:

Related