PSA: Hard lessons learned with Qwen3.6-35B-A3B

NB this is correct as of 02/05/2026 – things are changing literally from day to day so mileage varies

Problem: vLLM nightly – tool calling broken and loops in self hosted RedHatAI/Qwen3.6-35B-A3B-NVFP4

Broken tool calls

  • Disable speculative decoding (MTP). There are a few vllm regression bugs that seem to be triggered by streaming, which in turn is triggered by MTP even if the LLM client has streaming off

Looping

  • Set a higher temperature (1.0)
  • Manually set the --kv-cache-dtype to fp8
    • I usually leave this as auto but this was probably defaulting to bf16, setting it to fp8 seems to have resolved looping, when presence_penalty and repitition_penalty did not

Update

My reasoning about this is the quantization of the kv cache was leading to a “failed successfully” scenario – the errors introduced by kv cache quant forced the model to be less sure of the next token and so avoid “lock in” to a loop. That led me to raise the repetition_penalty higher than I usually do (I’ve never had the reason to go above 1.0) while turning off the presence_penalty, but setting the --kv-cache-dtype back to auto and increasing the repetition_penalty to 1.1 did the trick

⬆️ repetition_penalty: avoid loop repeating the same token or set of tokens
⬇️ presence_penalty: avoid interminable thinking where LLM model spits out random words because it can neither repeat itself nor use the same token it previously used

Current working recepie

VLLM_MARLIN_USE_ATOMIC_ADD=1 uv run vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 262144 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.7 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype auto \
--enable-prefix-caching \
--enable-chunked-prefill \
--load-format fastsafetensors \
--reasoning-parser qwen3 \
--language-model-only \
--chat-template chat-templates/qwen36.jinja \
--default-chat-template-kwargs '{"preserve_thinking": true}' \
--generation-config auto \
--override-generation-config '{"temperature": 1.0, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty":0.0, "repetition_penalty":1.1}' \
--moe_backend flashinfer_cutlass