Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining (2408.02657v3)

Published 5 Aug 2024 in cs.CV

Abstract: We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models with high efficiency through Flexible Progressive Supervised Fine-tuning (FP-SFT). Equipped with our proposed Unambiguous image Representation (UniRep), Lumina-mGPT can flexibly generate high-quality images of varying aspect ratios. Building on the strong image generation capabilities, we further explore Ominiponent Supervised Fine-tuning (Omni-SFT), an initial attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like text-to-image/multiview generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multi-turn visual question answering, showing the rosy potential of the technical direction. Codes and checkpoints are available at https://github.com/Alpha-VLLM/Lumina-mGPT.

Citations (19)

View on Semantic Scholar

Summary

The paper introduces a pretrained decoder-only transformer enhanced by multimodal generative pretraining and novel finetuning strategies for text-to-image synthesis.
It employs Flexible Progressive Supervised Finetuning and Omni-SFT to incrementally refine image resolution and unify diverse vision-language tasks.
The model achieves superior visual quality compared to earlier methods, promising broad applications in media, AI, and creative industries.

Overview of Lumina-mGPT: A Context-Aware Multimodal Generative Model

The paper "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining" by Liu et al. introduces Lumina-mGPT, a series of multimodal autoregressive models notable for their ability to execute a vast array of vision and language tasks efficiently. These models are particularly adept at generating high-quality photorealistic images from textual descriptions. Unlike previous autoregressive image generation models, Lumina-mGPT leverages a pretrained decoder-only transformer initialized with multimodal Generative PreTraining (mGPT). This approach capitalizes on next-token prediction across extensive text-image sequences to acquire broad multimodal competencies.

Key Contributions

Pretrained Decoder-Only Transformer:
- Lumina-mGPT employs a pretrained decoder-only transformer as its core architecture. This transformer is initialized using mGPT, which has been trained on large-scale interleaved text-image data using a next-token prediction objective. This enables Lumina-mGPT to acquire versatile and generalizable multimodal capabilities, a significant departure from the randomly initialized transformers traditionally used in autoregressive image generation.
Flexible Progressive Supervised Finetuning (FP-SFT):
- The authors introduce FP-SFT, a novel finetuning strategy where models are progressively finetuned on high-quality image-text pairs at increasing resolutions. This method enhances image generation by gradually exposing the model to higher resolution data, thereby improving image quality without compromising the model’s general multimodal capabilities.
Omnipotent Supervised Finetuning (Omni-SFT):
- Omni-SFT is designed to extend Lumina-mGPT's capabilities beyond text-to-image generation. The finetuning process incorporates diverse tasks such as segmentation, depth estimation, and multi-turn visual question answering, transforming Lumina-mGPT into a foundation model for task unification.

Strong Numerical Results and Claims

The paper presents robust numerical results highlighting Lumina-mGPT's capabilities:

Photorealistic Image Generation:
- Lumina-mGPT can generate images at arbitrary resolutions, a notable achievement given the challenges faced by traditional autoregressive models in scalable high-resolution image synthesis.
- Visual comparisons demonstrate Lumina-mGPT's superiority over contemporary models like LlamaGen and Parti, showcasing higher aesthetic quality and finer visual details.
Architectural Simplicity and Versatility:
- The paper presents a compelling argument for the efficacy of a simple decoder-only architecture complemented by multimodal generative pretraining. Lumina-mGPT’s unified framework simplifies the generation process, unlike the verbose encoder-decoder architectures employed by other models.

Theoretical and Practical Implications

The introduction of Lumina-mGPT has significant implications for both theoretical research and practical applications in AI:

Theoretical Implications:
- This work challenges the prevailing notion that diffusion models are superior for photorealistic image generation by presenting an autoregressive model that achieves comparable, if not superior, results.
- The paper suggests a promising path forward where multimodal generative pretraining can be an effective initialization strategy for large-scale autoregressive models, potentially leading to more efficient and capable models.
Practical Applications:
- Lumina-mGPT's ability to generate diverse, high-quality images from textual descriptions can be transformative for industries such as media, entertainment, and e-commerce, where visual content generation is crucial.
- The model's task unification capability opens new avenues for integrating multiple vision-language tasks into a single framework, simplifying the deployment of AI systems that need to perform a range of image-related and language-related tasks.

Future Developments in AI

Considering the advancements brought by Lumina-mGPT, several future research directions and practical enhancements can be envisaged:

Scaling Data and Computational Resources:
- As noted in the paper, larger datasets and increased computational resources could further enhance Lumina-mGPT's performance, especially in multilingual understanding and complex multi-turn interactions.
Inference Time Optimization:
- Although effective, the autoregressive nature of Lumina-mGPT poses challenges for inference speed. Integrating efficient sampling techniques and inference-time optimizations can significantly reduce generation latency, making the model more practical for real-time applications.
Enhancing Base Representations:
- Incorporating larger and more diversified datasets during pretraining could improve the model’s understanding of various languages and complex visual concepts, thus expanding its applicability and robustness.

In conclusion, Lumina-mGPT introduces a robust framework for multimodal text-to-image generation and task unification by leveraging multimodal generative pretraining within a simple decoder-only architecture. This work paves the way for future research to further refine and optimize such models, thereby pushing the boundaries of what is achievable in the fields of image generation and multimodal AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1820668912943423673

https://twitter.com/opengvlab/status/1826195322676433273

https://twitter.com/arXivGPT/status/1821251811569013215