Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data (2406.03736v2)

Published 6 Jun 2024 in cs.LG and cs.CL

Abstract: Discrete diffusion models with absorbing processes have shown promise in LLMing. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval. Empirically, RADD is up to 3.5 times faster while achieving similar performance with the strongest baseline. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot LLMing benchmarks (measured by perplexity) at the GPT-2 scale. Our code is available at https://github.com/ML-GSAI/RADD.

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel reparameterized absorbing discrete diffusion (RADD) model that reformulates the concrete score into conditional distributions of clean data using a time-dependent scalar.
It leverages caching strategies to reduce function evaluations by up to 3.5 times while maintaining or improving performance on zero-shot language modeling benchmarks.
The authors propose a new denoising cross-entropy loss that enables exact likelihood computation, providing a more precise and theoretically grounded foundation for optimization.

Analytical Review of "Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data"

In the paper titled "Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data," the authors introduce key advancements in the field of discrete diffusion models for LLMing. By exploring the intrinsic properties of absorbing discrete diffusion, the authors presented a novel factorization of the concrete score. This allows the model to express it as conditional probabilities of the clean data multiplied by an analytically derived time-dependent scalar. Leveraging this insight, the paper proposed the reparameterized absorbing discrete diffusion (RADD) model, presenting significant improvements in both theoretical grounding and practical application.

Key Contributions

Analytic Form of Concrete Score: The paper reveals that the concrete score in absorbing diffusion can be reformulated as a conditional distribution of the original data, scaled by a time-dependent term. This finding not only elucidates the theoretical basis of the "scaling trick" used in prior works (such as SEDD) but also simplifies the score matching process by removing the time-dependence from the estimation.
Reparameterized Absorbing Discrete Diffusion (RADD): The RADD model focuses on time-independent conditional probabilities. This simplification allows the model to leverage cache strategies during sampling, significantly reducing the number of function evaluations (NFEs). Empirical results demonstrate that RADD achieves up to 3.5 times faster sampling speeds compared to existing methods while maintaining or improving performance.
Denoising Cross-Entropy (DCE) Loss: A surprising result from the paper is the identification of a new loss formulation—denoising cross-entropy—which enables exact likelihood computation. This loss function matches the negative log-likelihood of the original data distribution with absorbing discrete diffusion, offering a more precise and theoretically sound foundation for model optimization.

Empirical Validation

The empirical results of RADD demonstrate substantial efficiency gains and performance improvements:

Sampling Efficiency: The RADD model, utilizing a caching strategy, achieved significant reductions in NFEs, as depicted in the experiments. For instance, the expected number of function evaluations (E-NFEs) aligns well with the theoretical results, particularly under a log-linear noise schedule.
Zero-Shot LLMing: When applied to zero-shot LLMing benchmarks like LAMBADA and WikiText2, the RADD model, particularly when trained with the DCE loss, demonstrated nuanced improvements over the strongest baselines (i.e., SEDD with scaling trick). The exact likelihood evaluation approach introduced in the paper significantly advanced the state-of-the-art discrete diffusion models.

Theoretical Implications

The theoretical implications of this research are manifold. By simplifying the score matching problem to focus solely on clean data distributions, the authors provided a framework that is more straightforward to optimize. Furthermore, the introduction of the DCE loss opens new avenues for exact likelihood training in discrete diffusion models, which may inspire subsequent research efforts to further refine and apply these ideas to more extensive and diverse datasets.

Future Developments

Future developments based on this research are promising. Potential areas of exploration include:

Scaling: Investigate the performance and efficacy of RADD and related models at larger scales, possibly leveraging more extensive training datasets and compute resources.
Integration with Transformer Architectures: Enhance the integration of RADD models with current state-of-the-art transformer architectures for further improvements in both generative quality and computational efficiency.
Variable-Length Outputs: Address current limitations around generating variable-length outputs to match the flexibility observed in autoregressive models.

Conclusion

The paper "Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data" represents a crucial step forward in maximizing the efficiency and theoretical robustness of discrete diffusion models. By grounding the model in the conditional distributions of clean data and introducing a streamlined and exact approach to likelihood estimation, the authors have provided a foundation upon which future work can build. The empirical success of RADD in terms of both sampling speed and LLMing performance signals significant potential for widespread application and further refinement in the context of advanced AI and machine learning research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/thjashin/status/1801350808413606006

https://twitter.com/thjashin/status/1801350243994276010

https://twitter.com/swe_acc/status/1891501537874039224

https://twitter.com/yuanzhi_zhu/status/1892664507664617903

YouTube

Show All Videos