Emergent Mind

Abstract

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

Examples show how training duration and model complexity affect PartiPrompts outcomes with varied steps and sizes.

Overview

  • Rectified Flow (RF) models provide a new approach for generative tasks with a focus on making training and sampling more efficient, especially for high-resolution image generation.

  • This study introduces innovations in noise sampling for RF models, emphasizing perceptually relevant scales, which enhances the quality of text-to-image synthesis.

  • The research proposes a novel transformer-based architecture that effectively integrates text and image modalities for improved synthesis outcomes.

  • Extensive evaluations indicate that the new RF models surpass existing models in high-resolution text-to-image generation, setting new benchmarks and promising directions for future research.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Introduction to Rectified Flow Models

Rectified Flow (RF) models have recently emerged as a potent approach for generative tasks, distinguishing themselves with their conceptual elegance and promising theoretical properties. These models formulate the generative process as traversing a straight path from data to noise, which, in theory, should streamline training and enhance sampling efficiency. However, despite their potential, RF models have not fully realized widespread application and performance validation in large-scale settings, particularly within the realm of text-to-image synthesis. This paper addresses this gap by introducing novel techniques aimed at leveraging the full capabilities of RF models for high-resolution image generation tasks, in conjunction with cutting-edge architecture and data preprocessing methods.

Enhanced Noise Sampling in RF Models

The study innovates in the domain of noise sampling for RF models by introducing a bias towards perceptually relevant scales. Through extensive experimentation, it is demonstrated that this re-weighted approach significantly outperforms traditional diffusion model formulations in the context of text-to-image synthesis. By optimizing noise sampling, the work showcases superior performance in generating high-fidelity images, marking a step forward in the practical application of RF models.

Novel Architectural Contributions

A novel architectural contribution of this research is the development of a transformer-based model that integrates separate weight streams for text and image modalities. This architecture facilitates a bidirectional exchange of information between text and imagery, enhancing the model's understanding and rendering of textual descriptions into images. The architecture's design allows for a predictable scaling behavior, correlating directly with improvements in text-to-image synthesis quality as assessed through a variety of metrics and human evaluations.

Large-Scale Evaluation and Findings

In a comprehensive study, the performance of the proposed methods is extensively evaluated against state-of-the-art models. The findings indicate that the new RF models set new benchmarks in high-resolution text-to-image generation, outperforming existing models in quantitative evaluations and human preference ratings. The research provides a systematic exploration of different diffusion model and RF formulations, identifying the most effective strategies for text-to-image synthesis.

Moreover, the work explore simulation-free training methodologies for RF models, presenting practical and reliable objectives. It addresses the challenge of formulating a generative model that operates efficiently across varying resolutions and aspect ratios, presenting an adaptable approach to positional encoding and timestep adjustments based on resolution scaling.

Implications and Future Prospects

This research holds significant implications for the advancement of generative models, reinforcing the viability of RF models for complex, high-dimensional tasks like text-to-image synthesis. By pushing the boundaries of RF model performance and scalability, the paper sets a foundation for future explorations that could further unlock the potential of these models.

The exploration of model scaling opens new avenues for generating images and videos with increasing fidelity and complexity, suggesting that further scaling and methodological refinements could yield even more impressive outcomes. Additionally, the flexible use of text encoders offers practical insights into managing computational resources while maintaining high performance, a critical consideration for deploying AI models at scale.

In conclusion, this study not only advances our understanding of RF models and their application to text-to-image synthesis but also prompts a reevaluation of current generative model benchmarks. By addressing both theoretical and practical challenges, the research paves the way for future developments in AI-driven, high-resolution image synthesis.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube