Create a Video View Paper

ELF: Continuous Diffusion for Language Without Token-Level Supervision

This talk introduces Embedded Language Flows (ELF), a continuous diffusion model that generates language by denoising entirely in embedding space and discretizing only at the final step. By operating without intermediate token supervision or separate decoder architectures, ELF achieves competitive generation quality and data efficiency compared to discrete diffusion models while maintaining the theoretical elegance of continuous-time flow matching.

Script

Diffusion models have revolutionized image generation by operating in continuous spaces, but language models have stubbornly remained discrete. ELF breaks that pattern by denoising entirely in embedding space and postponing token selection to the very last moment.

The training process is surprisingly minimalist. Tokens encode into clean embeddings using a frozen pretrained encoder, get corrupted with noise, and a single network predicts the clean embeddings back. No decoder runs during training, and the network learns both denoising and final discretization with shared weights.

A critical design choice emerges in what the network predicts. The authors compare three targets: clean embeddings, velocity, and noise. Clean embedding prediction remains stable even as embedding dimension scales from 512 to 1024, while velocity and noise predictions degrade or collapse entirely, supporting the hypothesis that language data lives on low-dimensional manifolds.

On OpenWebText, ELF with just 105 million parameters reaches generative perplexity 24 in 32 steps, outperforming both discrete diffusion models and prior continuous approaches. Remarkably, it achieves this using substantially fewer training tokens than competing methods, and even rivals distilled baselines that require additional training rounds.

The denoising trajectory reveals how language emerges from noise. As the diffusion time parameter increases from 0 to 1, the model transforms ungrammatical fragments into fluent coherent sentences, visibly refining syntax and semantics at each step through continuous embedding space.

ELF demonstrates that language modeling can adopt the continuous-time paradigms that revolutionized vision models, achieving strong results without intermediate token supervision. If you're intrigued by this shift from discrete to continuous generation, explore the full paper and create your own explainer videos at EmergentMind.com.