At the time of writing, the two main approaches to speculative decoding with Qwen are Using layers from the model itself. This is only supported in specifically trained models such as Qwen 3.6, and is the default approach recommended by Qwen. In vLLM:--speculative-config '{"method":"mtp","num_speculative_tokens": 2}' Using a "drafter" model. In this case a much smaller … Continue reading Speculative decoding with Qwen 3.6-35B-A3B
PSA: Hard lessons learned with Qwen3.6-35B-A3B
NB this is correct as of 02/05/2026 - things are changing literally from day to day so mileage varies Problem: vLLM nightly - tool calling broken and loops in self hosted RedHatAI/Qwen3.6-35B-A3B-NVFP4 Broken tool calls Disable speculative decoding (MTP). There are a few vllm regression bugs that seem to be triggered by streaming, which in … Continue reading PSA: Hard lessons learned with Qwen3.6-35B-A3B
You must be logged in to post a comment.