Emergent Mind

Abstract

Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster

Overview of an RPG framework designed for converting text descriptions into images.

Overview

  • The paper presents RPG, a framework using MLLMs for improved text-to-image generation and editing.

  • RPG decomposes complex prompts into sub-tasks allowing detailed compositional image generation and editing.

  • The RPG framework enables semantic alignment, precise editing, and iterative refining in text-to-image tasks without additional training.

  • Extensive experiments show RPG's superiority over models like DALL-E 3 and SDXL in prompt-image content alignment.

  • The future of RPG includes expansion to more complex modalities, emphasizing its potential in creative and design applications.

Introduction

The paper introduces a new text-to-image generation and editing framework called Recaption, Plan and Generate (RPG), leveraging the chain-of-thought reasoning ability of multimodal LLMs to enhance diffusion models' compositionality. The RPG framework employs MLLMs as a 'global planner' that decomposes complex imaging tasks into simpler sub-tasks, linked to distinct subregions within the image. It introduces regional diffusion techniques that allow for region-wise compositional image generation and proposes a unified, closed-loop approach for both image generation and image editing tasks. The experiments reveal that RPG outperforms established models like DALL-E 3 and SDXL, specifically in handling complex prompts with multiple categories and semantic alignments.

Methodology Overview

The RPG framework operates sans additional training, employing a three-step strategy that includes Multimodal Recaptioning, Chain-of-Thought Planning, and Complementary Regional Diffusion. MLLMs decompose text prompts into descriptive subprompts, which allow for detailed descriptions and semantic alignment during diffusion processes. CoT Planning is applied to allocate subprompts to complementary regions, treating the complex generation task as a collection of simpler ones. Complementary Regional Diffusion is proposed to realize regional generation and spatial merging, effectively navigating around the challenge of content conflicts in overlapping image components.

Compositional Generation and Editing

The RPG framework demonstrates versatility in handling both generation and editing tasks. For editing, it employs MLLMs to provide feedback identifying semantic discrepancies between generated images and target prompts, leverages CoT planning to delineate editing instructions, and utilizes contour-based diffusion for precise region modification. The framework showcases an ability to refine the generation process iteratively through a closed-loop implementation that incorporates feedback from earlier rounds of editing.

Experiments and Findings

The evaluation of RPG is carried out extensively; figures within the paper illustrate the framework's superiority in aligning complex textual prompts with generated image contents. Multiple datasets and benchmarks are utilized to assess RPG's performance, including the T2I-Compbench. RPG exhibits the capacity to adapt to different MLLM architectures and diffusion backbones, proving its flexibility and potential for wide application. In image editing comparisons, RPG outshines other state-of-the-art methods by producing more precise and semantically aligned edited images. Through iterative refinements, the method achieves further alignment and improvement in results.

Conclusion and Outlook

The RPG framework sets a new bar in handling complex and compositional text-to-image tasks, effectively leveraging the reasoning capabilities of MLLMs to plan image compositions for diffusion models. It presents a training-free, versatile approach and is compatible with various architecture types. Future research will aim at expanding the RPG framework to accommodate even more complex modalities and apply it to a broader spectrum of practical scenarios, solidifying text-to-image generation's position as a key technology in creative and design applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube