chatgpt

At the time of writing, the two main approaches to speculative decoding with Qwen are Using layers from the model itself. This is only supported in specifically trained models such as Qwen 3.6, and is the default approach recommended by Qwen. In vLLM:--speculative-config '{"method":"mtp","num_speculative_tokens": 2}' Using a "drafter" model. In this case a much smaller … Continue reading Speculative decoding with Qwen 3.6-35B-A3B

Share this: