Emergent Mind

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

(2310.05737)
Published Oct 9, 2023 in cs.CV , cs.AI , and cs.MM

Abstract

While LLMs are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Overview

  • MAGVIT-v2 introduces a refined tokenizer with lookup-free quantization (LFQ) for improved video and image processing within the VQ-VAE framework, enabling superior language model performance over diffusion models.

  • The model demonstrates excellence in image and video generation tasks, outperforming diffusion models in standard benchmarks like ImageNet and Kinetics and showing potential in video compression and action recognition tasks.

  • Innovations in MAGVIT-v2, such as LFQ, allow for handling larger vocabularies efficiently, enhancing the model’s capacity for high-quality visual generation without compromising generation quality.

  • Empirical validations reveal MAGVIT-v2’s notable advancements in image and video generation, with significant implications for media processing applications and generative model research.

Leveraging Language Model Innovations for Enhanced Visual Generation with MAGVIT-v2

Introduction to MAGVIT-v2

The paper presents MAGVIT-v2, a tokenizer refinement for video and image processing, building upon the initial version of MAGVIT within the Vector Quantized Variational AutoEncoder (VQ-VAE) framework. This development introduces a novel quantization method, termed lookup-free quantization (LFQ), and specific adaptations that collectively enhance tokenization for both video and image applications. The refined tokenizer boosts language models' performance, enabling them to surpass diffusion models in image and video generation benchmarks, such as ImageNet and Kinetics, presenting a crucial step in visual media processing.

Key Contributions

Several findings and contributions stand out in this work:

  • Enhanced Visual Tokenization: The introduction of MAGVIT-v2 outlines improvements in video and image tokenization, particularly through the utilization of LFQ, which allows for the efficient handling of larger vocabularies essential for high-quality generation.
  • Superior Performance Over Diffusion Models: Empirical results demonstrate that with the proposed tokenizer enhancements, language models can outperform state-of-the-art diffusion models in standard video and image generation tasks, recorded on benchmarks like ImageNet and Kinetics.
  • Advancements in Video Compression: Aside from generation tasks, MAGVIT-v2 exhibits potential in video compression, showing better or comparable quality to contemporary standards like HEVC and VVC in user studies, pointing towards a promising direction for efficient digital media transmission.
  • Improvement in Action Recognition Tasks: The paper also highlights the tokenizer’s effectiveness in encoding action recognition elements within videos, suggesting its applicability in broader video understanding and processing applications.

Architectural and Methodological Innovations

The introduction of LFQ is a pivotal innovation in MAGVIT-v2. By sidestepping the need for embedding lookup in the quantization process, LFQ permits the model to handle significantly larger vocabularies without compromising generation quality, a critical improvement over traditional VQ methods. Additionally, the paper discusses essential modifications to the MAGVIT architecture, optimizing it for both video and still image tokenization. These technical advancements collectively contribute to the model's enhanced performance.

Empirical Validation

The paper substantiates its claims through rigorous empirical validation. In image generation tasks on ImageNet, MAGVIT-v2 achieves noteworthy improvements in FID scores over leading diffusion models. Similarly, in video generation benchmarks, the model demonstrates superior FID scores, underscoring the efficacy of the proposed tokenizer and the language model's capacity for dealing with complex visual generation tasks.

Implications and Future Prospects

The findings of this study have significant implications for both practical applications in media processing and theoretical advancements in generative model research. The success of MAGVIT-v2 in surpassing diffusion models in key benchmarks encourages further exploration of language models' potential in visual tasks. Moreover, the advancements in video compression suggest possible applications in reducing bandwidth and storage requirements for video content, which is of particular interest in the era of high-resolution digital media. Future research could explore the integration of these tokenization techniques across diverse modalities and the continued refinement of language models for even more challenging generative tasks.

Conclusion

MAGVIT-v2 represents a significant stride in the realm of visual tokenization, enabling language models to excel in image and video generation tasks traditionally dominated by diffusion models. Through technical innovations such as LFQ and targeted architectural adjustments, this work opens new avenues for research and application in visual media processing, underlining the versatility and potential of language models in understanding and generating visual content.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.