Simple and Effective Masked Diffusion Language Models (2406.07524v2)

Published 11 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form -- it is a mixture of classical masked language modeling losses -- and can be used to train encoder-only LLMs that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional LLM. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We provide the code, along with a blog post and video tutorial on the project page: https://s-sahoo.com/mdlm

Citations (24)

View on Semantic Scholar

Summary

The paper introduces a novel masked diffusion model using SUBS training that combines scratch training with fine-tuning for improved performance.
It employs a selective masking strategy that effectively learns relationships between masked and unmasked tokens for language modeling.
Experimental results demonstrate that the fine-tuned model outperforms competitors on genomic tasks, with notable accuracy gains on datasets like Mouse Enhancers and Human NonTATA Promoters.

Simple and Effective Masked Diffusion LLMs

This paper introduces a simplified approach to masked diffusion LLMs, focusing on achieving strong performance with minimal complexity. The core idea involves a novel training strategy, SUBS, which combines training from scratch with fine-tuning to enhance performance across various genomic sequence classification tasks.

Methodology

The paper diverges from traditional diffusion models by employing a masking strategy that selectively corrupts the input sequence. This approach allows the model to focus on learning the relationships between masked and unmasked tokens, which is particularly relevant for language modeling tasks. The SUBS training regime involves two phases: initial training from scratch and subsequent fine-tuning. This hybrid approach aims to leverage the benefits of both methods, enabling the model to learn general language patterns and adapt to specific task requirements.

Experimental Results

The efficacy of the proposed method is demonstrated through a series of experiments on genomic sequence classification tasks. The results indicate that SUBS achieves competitive performance compared to existing methods like Mamba, SEDD, Caducues, Plaid, and D3PM. Notably, the fine-tuned SUBS model often outperforms other models, highlighting the benefits of the two-stage training approach. For instance, on the "Mouse Enhancers" dataset, SUBS (fine-tuned) achieves an accuracy of 0.795, outperforming Mamba (0.763) and D3PM (0.787). Similarly, on the "Human NonTATA Promoters" dataset, SUBS (fine-tuned) reaches an accuracy of 0.938, surpassing Mamba (0.926) and SEDD (0.935).

Implications and Future Directions

The presented approach offers a practical and efficient method for training masked diffusion LLMs. The simplicity of the masking strategy and the effectiveness of the SUBS training regime make it an attractive option for researchers and practitioners. Future work could explore the application of this method to other language modeling tasks and investigate the impact of different masking strategies and fine-tuning techniques. Additionally, it would be interesting to analyze the computational efficiency and scalability of the proposed method in more detail.

Conclusion

The paper introduces a refined approach to masked diffusion LLMs, achieving commendable results on genomic sequence classification tasks. The simplicity and efficacy of the proposed method make it a valuable addition to the field, offering a strong baseline for future research.