IMDM

Infinite Mask Diffusion for Few-Step Distillation

KAIST

TL;DR: We propose Infinite Mask Diffusion Model (IMDM), which leverages the simple design and effective conditional generation of Masked Diffusion Models while overcoming their theoretical lower bound of factorization error.

IMDM successfully generates high-quality samples with few-steps, significantly outperforming existing few-step distillation methods at small step counts on both LM1B and OpenWebText datasets.

Abstract

Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data allows their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText.

Key Innovations

  1. A theoretical lower bound on factorization error in MDMs.

    We identify a fundamental lower bound on the factorization error that standard Masked Diffusion Models cannot escape, originating from their use of a single deterministic mask state. This bound explains why MDMs degrade sharply in the few-step regime.

  2. Infinite Mask Diffusion Model (IMDM) with a stochastic infinite-state mask.

    IMDM replaces the deterministic single-state mask with a stochastic, infinite-state mask, mitigating the theoretical error bound while preserving the simplicity and conditional-generation benefits of MDMs — including direct compatibility with pre-trained MDM weights.


Results on LM1B

IMDM achieves strong few-step generation on the LM1B benchmark, outperforming existing distillation methods at small step counts.

vs. SDTT
LM1B unconditional gen PPL vs SDTT
vs. ReDi
LM1B unconditional gen PPL vs ReDi
SDTT + ReDi
LM1B unconditional gen PPL SDTT+ReDi

Unconditional generation perplexity on LM1B across few-step regimes.


Results on OpenWebText

IMDM continues to outperform baselines on OpenWebText (OWT).

Unconditional PPL
OWT unconditional gen PPL
Conditional PPL
OWT conditional gen PPL
Conditional MAUVE
OWT conditional MAUVE

Generation quality on OpenWebText across unconditional and conditional settings.

860M Model Results on OpenWebText

IMDM scales effectively to the 860M model, consistently outperforming baselines on OpenWebText (OWT).

Unconditional PPL
OWT unconditional gen PPL
Conditional PPL
OWT conditional gen PPL
Conditional MAUVE
OWT conditional MAUVE

Generation quality of the 860M model on OpenWebText across both unconditional and conditional generation settings.

Citation

@inproceedings{yoo2026imdm,
      title={Infinite Mask Diffusion for Few-Step Distillation}, 
      author={Yoo, Jaehoon and Kim, Wonjung and Lee, Chanhyuk and Hong, Seunghoon},
      year={2026},
      booktitle={ICML}
}