Emergent Mind

Abstract

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.

Variants of Lumina-T2X showing text encoder options, Flag-DiT sizes, prediction targets, VAE sizes, and other configurations.

Overview

  • Lumina-T2X introduces Flow-based Large Diffusion Transformers (Flag-DiT) to effectively transform textual instructions into diverse digital formats like images, videos, 3D models, and audio clips.

  • The system includes enhancements in tokenization and architectural stability, supporting high-resolution outputs and flexible handling of multiple data modalities with features like RoPE, RMSNorm, and flow matching.

  • Lumina-T2X operates more efficiently during training than previous models, showcasing exceptional capabilities in multimodal content generation which can significantly impact fields like digital media and virtual reality.

Exploring Lumina-T2X: Scaling Diffusion Transformers for Multimodal Content Generation

Introduction to Lumina-T2X

Lumina-T2X represents a significant advancement in the domain of multimodal content generation using AI. The system employs a family of models known collectively as Flow-based Large Diffusion Transformers (Flag-DiT), designed to generate images, videos, multi-view 3D objects, and audio clips based on textual instructions. Leveraging comprehensive modifications like zero-initialized attention and innovative tokenizing strategies for handling various data modalities, Lumina-T2X aims to seamlessly convert textual descriptions into high-quality multimodal outputs.

Core Components of Lumina-T2X

Unified Framework and Tokenization

A standout feature of Lumina-T2X is its unified approach to handling different data modalities. This is achieved through a strategic tokenization of the spatial-temporal space. The system utilizes learnable placeholders such as [nextline] and [nextframe] tokens, enabling it to manage various resolutions, aspect ratios, and durations effortlessly. This flexibility proves extremely beneficial during inference, allowing for dynamic adjustments based on the desired output specifications.

Architectural Enhancements

Lumina-T2X implements several architectural improvements to enhance stability, flexibility, and scalability. Noteworthy among these is the use of techniques like Rotary Position Encoder (RoPE), RMSNorm, and flow matching. These enhancements not only stabilize the training process but also empower Lumina-T2X to scale up to an impressive 7 billion parameters, facilitating the generation of ultra-high-definition content and lengthy HD videos with ease.

Practical Implications and Performance

Efficiency in Training

Lumina-T2X demonstrates a notable reduction in computational costs during training, particularly when compared to earlier models like PixArt-$\alpha$. The advanced model Lumina-T2I, which is a part of the Lumina-T2X family, requires only 35% of the computational resources necessary for training its predecessors, achieving similar or superior results. This efficiency is largely attributed to the larger model size and optimized training algorithms which accelerate model convergence significantly.

Multi-Modality and High Resolution

One of the groundbreaking capabilities of Lumina-T2X is its proficiency in handling different modalities within a single framework. The ability to generate not only images and videos but also multi-view 3D objects and audio clips from textual descriptions marks a significant step forward in generative AI systems. Moreover, its adeptness at surpassing the training resolution during inference showcases its potential in various applications, including digital media, entertainment, and virtual reality.

Future Outlook

While Lumina-T2X has paved the way for advanced multimodal content generation, the journey doesn't end here. The possibilities for future developments include refining the model's capabilities in understanding and executing more complex textual instructions, enhancing the quality of generated videos, and extending the model's applicability to other emerging formats and modalities. Additionally, the ongoing release of codes and checkpoints will further support the research community, encouraging more innovation and improvements in this exciting field.

Conclusion

Lumina-T2X represents a sophisticated blend of AI technologies designed to transform textual descriptions into a variety of digital formats. Its advanced capabilities in generating high-quality, resolution-flexible content across different modalities set a new benchmark in the field of generative AI. As we move forward, the potential applications of such technology continue to expand, promising to revolutionize the way digital content is created.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube