Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation (2310.05737v3)

Published 9 Oct 2023 in cs.CV, cs.AI, and cs.MM

Abstract: While LLMs are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Citations (152)

View on Semantic Scholar

Summary

The paper introduces a novel lookup-free quantization (LFQ) method that refines visual tokenization for both images and videos, leading to improved FID scores on ImageNet and Kinetics.
The study demonstrates that language models enhanced by MAGVIT-v2 consistently outperform state-of-the-art diffusion models in generating high-quality visual content.
It reveals potential applications beyond generation, including advancements in video compression and action recognition, promising more efficient digital media processing.

Leveraging LLM Innovations for Enhanced Visual Generation with MAGVIT-v2

Introduction to MAGVIT-v2

The paper presents MAGVIT-v2, a tokenizer refinement for video and image processing, building upon the initial version of MAGVIT within the Vector Quantized Variational AutoEncoder (VQ-VAE) framework. This development introduces a novel quantization method, termed lookup-free quantization (LFQ), and specific adaptations that collectively enhance tokenization for both video and image applications. The refined tokenizer boosts LLMs' performance, enabling them to surpass diffusion models in image and video generation benchmarks, such as ImageNet and Kinetics, presenting a crucial step in visual media processing.

Key Contributions

Several findings and contributions stand out in this work:

Enhanced Visual Tokenization: The introduction of MAGVIT-v2 outlines improvements in video and image tokenization, particularly through the utilization of LFQ, which allows for the efficient handling of larger vocabularies essential for high-quality generation.
Superior Performance Over Diffusion Models: Empirical results demonstrate that with the proposed tokenizer enhancements, LLMs can outperform state-of-the-art diffusion models in standard video and image generation tasks, recorded on benchmarks like ImageNet and Kinetics.
Advancements in Video Compression: Aside from generation tasks, MAGVIT-v2 exhibits potential in video compression, showing better or comparable quality to contemporary standards like HEVC and VVC in user studies, pointing towards a promising direction for efficient digital media transmission.
Improvement in Action Recognition Tasks: The paper also highlights the tokenizer’s effectiveness in encoding action recognition elements within videos, suggesting its applicability in broader video understanding and processing applications.

Architectural and Methodological Innovations

The introduction of LFQ is a pivotal innovation in MAGVIT-v2. By sidestepping the need for embedding lookup in the quantization process, LFQ permits the model to handle significantly larger vocabularies without compromising generation quality, a critical improvement over traditional VQ methods. Additionally, the paper discusses essential modifications to the MAGVIT architecture, optimizing it for both video and still image tokenization. These technical advancements collectively contribute to the model's enhanced performance.

Empirical Validation

The paper substantiates its claims through rigorous empirical validation. In image generation tasks on ImageNet, MAGVIT-v2 achieves noteworthy improvements in FID scores over leading diffusion models. Similarly, in video generation benchmarks, the model demonstrates superior FID scores, underscoring the efficacy of the proposed tokenizer and the LLM's capacity for dealing with complex visual generation tasks.

Implications and Future Prospects

The findings of this paper have significant implications for both practical applications in media processing and theoretical advancements in generative model research. The success of MAGVIT-v2 in surpassing diffusion models in key benchmarks encourages further exploration of LLMs' potential in visual tasks. Moreover, the advancements in video compression suggest possible applications in reducing bandwidth and storage requirements for video content, which is of particular interest in the era of high-resolution digital media. Future research could explore the integration of these tokenization techniques across diverse modalities and the continued refinement of LLMs for even more challenging generative tasks.

Conclusion

MAGVIT-v2 represents a significant stride in the field of visual tokenization, enabling LLMs to excel in image and video generation tasks traditionally dominated by diffusion models. Through technical innovations such as LFQ and targeted architectural adjustments, this work opens new avenues for research and application in visual media processing, underlining the versatility and potential of LLMs in understanding and generating visual content.