S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation
S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation
Authors: Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava Date: 2026-03-26 Paper ID: arxiv:2603.25702
Summary
Block-diffusion language models offer fast generation but suffer from brittle performance in the few-step decoding regime. This paper introduces S2D2, a novel training-free self-speculative decoding framework designed to enhance the accuracy-speed tradeoff for these models. S2D2 exploits the observation that a block-diffusion model becomes fully autoregressive when the block size is one, enabling the pretrained model to serve as both a parallel proposer and an autoregressive verifier. The framework inserts this speculative verification step into standard block-diffusion decoding, guided by lightweight routing policies to optimize when verification is applied. Results show S2D2 consistently outperforms standard thresholding baselines, achieving significant speedups (up to 4.7x) while maintaining or improving generation quality on benchmarks like SDAR and LLaDA2.1-Mini.
Key Contributions
- Introduction of S2D2, a training-free self-speculative decoding framework for block-diffusion language models that leverages the model’s autoregressive behavior at block size one for verification.
- Development of lightweight routing policies to dynamically decide when the speculative verification step is computationally worthwhile, leading to a hybrid decoding trajectory.
- Demonstration of significant accuracy-speed tradeoff improvements across three mainstream block-diffusion families compared to confidence-thresholding baselines.
- Achieving up to 4.7x speedup over autoregressive decoding and 1.57x over a dynamic decoding baseline on the SDAR benchmark while increasing accuracy by up to 4.5 points.
Limitations
The effectiveness of the lightweight routing policies and the overall speedup depend heavily on the quality of the proposals generated by the block-diffusion process and the tuning of the verification cost-benefit trade-off.
Open Questions & Future Work
- s2d2-token-acceptance-estimators
- s2d2-advanced-routing-policies
- s2d2-ratio-tempering-optimization
- s2d2-causal-drafting-mask-analysis
Key Concepts
- S2D2: A training-free self-speculative decoding framework that integrates speculative verification into block-diffusion language model decoding using lightweight routing policies.
Datasets
Limitations
The effectiveness of the lightweight routing policies and the overall speedup depend heavily on the quality of the proposals generated by the block-diffusion process and the tuning of the verification cost-benefit trade-off.
Links
Metadata & Links
- url
- https://arxiv.org/abs/2603.25702
- paper_id
- 2603.25702
- paper_source
- arxiv
- domain
- nlp
- tags
- llmautoregressivedecodingspeculative-decodingdiffusion-modellanguage-modelefficiency
- architectures
- decoder-only
- datasets
- SDARLLaDA2.1-Mini
- skill
- GeneralMLSkill
- created_at
- 2026-03-27T09:10:03Z