Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Published 12 Mar 2025 in cs.LG and cs.AI | (2503.09573v3)

Abstract: Diffusion LLMs offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion LLMs that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces BD3-LMs that integrate autoregressive block-level generation with discrete denoising diffusion, enabling flexible-length text synthesis.
It leverages a two-pass and vectorized training strategy using custom attention masks and KV caching to boost efficiency by 20–25%.
Experimental results demonstrate improved perplexity on benchmarks like LM1B and OpenWebText, with enhanced sample quality and accelerated inference.

This paper introduces Block Discrete Denoising Diffusion LLMs (BD3-LMs), a novel class of models that bridges the gap between autoregressive (AR) and discrete denoising diffusion models for language generation. The primary motivation is to overcome key limitations of both paradigms: diffusion models often struggle with likelihood modeling, are restricted to fixed-length generation, and lack efficient inference mechanisms like KV caching, while AR models generate tokens sequentially, limiting speed.

BD3-LMs operate by being autoregressive over blocks of tokens while performing discrete denoising diffusion within each block. This hybrid approach allows for flexible-length generation and improves inference efficiency through KV caching and parallel token sampling within blocks.

Core Concepts and Implementation

1. Model Architecture and Likelihood:

A sequence of $L$ tokens $\mathbf{x}$ is divided into $B$ blocks, each of length $L'$ , so $L = B \cdot L'$ .
The log-likelihood is factorized autoregressively over these blocks:

$\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$
Each conditional probability $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ is modeled by a discrete diffusion process specific to block $b$ , conditioned on previously generated blocks $\mathbf{x}^{<b}$ .
A single transformer neural network $f_\theta$ parameterizes the base denoiser for all blocks. It uses a block-causal attention mask, where tokens in block $\mathbf{x}$ 0 attend to other tokens within the (potentially noised) block $\mathbf{x}$ 1 and all clean tokens in preceding blocks $\mathbf{x}$ 2.
The model supports KV caching:

$\mathbf{x}$ 3

where $\mathbf{x}$ 4 is the noised version of block $\mathbf{x}$ 5 at timestep $\mathbf{x}$ 6, and $\mathbf{x}$ 7 are cached keys and values from previous blocks.

2. Training Objective:

The training objective is derived by applying the Negative ELBO (NELBO) to each block-conditional term:

$\mathbf{x}$ 8

where $\mathbf{x}$ 9 is the standard diffusion NELBO for block $B$ 0 conditioned on $B$ 1.
For masked BD3-LMs (using a masking noise process), a simplified objective is adopted:

$B$ 2

where $B$ 3 defines the noise schedule (probability of a token not being masked at time $B$ 4), and $B$ 5 is its derivative.

3. Efficient Training Algorithm (Algorithm 1):

Naively computing the loss would require $B$ 6 separate forward passes for denoising each block, as denoising block $B$ 7 uses a noised $B$ 8 while conditioning on clean previous blocks $B$ 9.
Two-Pass Approach:

1. First Pass (KV Cache Precomputation): Compute keys and values $L'$ 0 for the entire clean sequence $L'$ 1 in one forward pass: $L'$ 2. 2. Second Pass (Denoising): For each block $L'$ 3, sample noise levels $L'$ 4 and create noised blocks $L'$ 5. Compute denoised predictions for all blocks simultaneously using the precomputed KV cache: $L'$ 6. - Vectorized Single-Pass Training: An even more efficient method concatenates the noisy data $L'$ 7 and clean data $L'$ 8 into a single input sequence of length $L'$ 9. A custom attention mask (detailed in Appendix \ref{suppl:masks}) is designed so that noisy tokens attend to other noisy tokens in their block and to clean tokens in preceding blocks. This leverages efficient attention kernels like FlashAttention or the proposed FlexAttention (Appendix \ref{suppl:flex-attention-kernels}), yielding a 20-25% training speed-up over the two-pass approach.

$b$ 0

4. Efficient Sampling Algorithm (Algorithm 2):

Blocks are generated sequentially.
For each block $L = B \cdot L'$ $L = B \cdot L^{'}$ 0:
1. Sample the clean block $L = B \cdot L'$ 1 using a diffusion sampling procedure (e.g., D3PM sampler) conditioned on previously generated clean blocks $L = B \cdot L'$ 2 (via their cached keys and values $L = B \cdot L'$ 3). This step involves multiple denoising steps within the block.
2. Compute and cache keys and values for the newly sampled block $L = B \cdot L'$ 4: $L = B \cdot L'$ 5.
3. Append $L = B \cdot L'$ 6 to the generated sequence and update the overall KV cache.
This allows for arbitrary-length sequence generation and benefits from parallel generation within each block.

$b$ 1

5. Addressing Gradient Variance and Improving Performance:

A key finding is that the perplexity gap between diffusion models and AR models can be attributed to high variance in the gradients of the diffusion objective during training.
Case Study ( $L = B \cdot L'$ 7): When block size is 1, BD3-LM is theoretically equivalent to AR. However, standard masked diffusion (masking ~50% of tokens) results in higher perplexity than AR. This is because the diffusion objective effectively trains on fewer tokens per step. By using a "full masking" schedule ( $L = B \cdot L'$ 8), the BD3-LM ( $L = B \cdot L'$ 9) matches AR performance, and gradient variance is significantly reduced.
Clipped Noise Schedules: To minimize gradient variance for $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 0, the paper proposes "clipped" noise schedules where mask rates ( $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 1) are sampled uniformly from a sub-interval $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 2 instead of $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 3. This avoids extreme masking rates (very few or very many masks) which provide poor learning signals and lead to high-variance gradients.
Data-Driven Schedule Optimization: The optimal $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 4 and $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 5 are found to be block-size dependent. They are learned adaptively during training by performing a grid search at regular intervals to find values that minimize the variance of the NELBO estimator (used as a proxy for gradient variance):

$\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 6

Experimental Results and Practical Implications

State-of-the-Art Perplexity: BD3-LMs achieve new state-of-the-art perplexities among discrete diffusion models on LM1B and OpenWebText benchmarks, significantly closing the gap to AR models. For example, on LM1B, BD3-LM ( $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 7) achieves $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 8 PPL, compared to MDLM's $\log p_\theta(\mathbf{x}) = \sum_{b = 1}^{B} \log p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 9 PPL.
Variable-Length Generation: BD3-LMs can generate sequences much longer than their training context (e.g., up to ~10x longer than fixed-length diffusion models like SEDD on OWT).
Improved Sample Quality: BD3-LMs show better generative perplexity (Gen. PPL, evaluated by GPT2-Large) compared to prior diffusion methods like SEDD, MDLM, and SSD-LM, often with an order of magnitude fewer generation steps (NFEs) than methods like SSD-LM.
- For $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 0, BD3-LM ( $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 1) achieves Gen. PPL of 23.6 with 2K NFEs, while SSD-LM ( $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 2, comparable NFEs) gets 281.9, and MDLM gets 41.3.
Efficiency of Clipped Schedules: Data-driven clipped noise schedules are shown to reduce training variance and improve test perplexity compared to standard linear or other common schedules. The optimal clipping range varies with block size (e.g., heavier masking for smaller $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 3).
Computational Cost: Training BD3-LMs is inherently more expensive than standard diffusion due to the block-autoregressive nature and potentially multiple passes or larger effective sequence lengths. The proposed vectorized training algorithm keeps this overhead manageable (within <2x of standard diffusion). Pre-training with a standard diffusion loss before fine-tuning with the block diffusion objective can further reduce costs.

Implementation Considerations

Computational Requirements: Training requires careful management of memory and computation, especially with the vectorized approach (concatenating sequences). Efficient attention kernels (FlashAttention, FlexAttention) are crucial.
Choosing Block Size ( $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 4): The optimal block size is task-dependent. Smaller $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 5 approaches AR behavior (more sequential steps, potentially better perplexity). Larger $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 6 increases parallelism but might make learning harder or loosen the NELBO bound more. Experiments show $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 7 often gives the best perplexity.
Noise Schedule Tuning: Implementing the data-driven clipped schedule optimization requires periodic evaluation of NELBO variance for different $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 8 ranges. This adds some overhead but is shown to be beneficial.
KV Cache Implementation: Standard transformer KV caching mechanisms can be adapted. The key is to correctly pass and update the cache across block generation steps during sampling, and to use it appropriately during the second pass or vectorized pass of training.
Deployment: For inference, the block-sequential generation means latency will be higher than fully parallel diffusion models but potentially lower than token-by-token AR models if $p_\theta(\mathbf{x}^{b} \mid \mathbf{x}^{<b})$ 9 is large enough and intra-block parallelism is exploited.

In summary, BD3-LMs offer a practical framework for building high-quality, flexible-length LLMs that combine strengths from AR and diffusion paradigms. The paper provides concrete algorithms for training and sampling, addresses the critical issue of gradient variance through novel noise schedules, and demonstrates strong empirical results. The code and model weights are made available, facilitating adoption and further research.

Markdown Report Issue