At the time of writing, the two main approaches to speculative decoding with Qwen are
- Using layers from the model itself. This is only supported in specifically trained models such as Qwen 3.6, and is the default approach recommended by Qwen. In vLLM:
--speculative-config '{"method":"mtp","num_speculative_tokens": 2}' - Using a “drafter” model. In this case a much smaller model is trained to draft tokens which the main model can accept in parallel. This is the only approach for models not trained from the ground up to use MTP – such as Gemma 4. But Qwen 3.6 can still use this approach. Currently DFlash (very fast drafter models using diffusion) is the preferred approach. In vLLM:
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'
In my testing on a single Nvidia Spark DGX the following emerged:
- When using quantized models such as FP8, DFlash performance takes a nose dive. No speculative tokens past the first 4 (out of 15) get chosen, so it slows down a lot. Using the normal BF16 solves this but at the cost of a lot higher memory usage
- MTP method leads to better performance and behaves better BUT often leads to loops. Presence penalty didnt do much good but repetition penalty set to 1.1 did have a meaningful impact
Tool calling
Tool calling in general works well but MTP somehow results in a slightly higher error rate. Two approaches which worked:
- Using an “enhanced chat template”. The ones from Froggeric (https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/tree/main) are awesome. Just use the latest version as previous versions would cause the model to simply stop generating and sit there
- Use a harness. Tool calling issues were “solved” by having a harness that detects a bad tool call and prompt the model to try again while “reminding” the model the correct format for tool calling