MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion (2402.12741v2)

Published 20 Feb 2024 in cs.CV

Abstract: Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. To efficiently address these challenges, we develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object with intricate planning and feedback control. MuLan harnesses a LLM to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion, conditioned on previously generated objects. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined upon each sub-task by an LLM and attention guidance. Moreover, MuLan adopts a vision-LLM (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. The multi-step process also allows human users to monitor the generation process and make preferred changes at any intermediate step via text prompts, thereby improving the human-AI collaboration experience. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines and its creativity when collaborating with human users. The code is available at https://github.com/measure-infinity/mulan-code.

References (21)

Authors (5)

Sen Li (60 papers)
Ruochen Wang (29 papers)
Cho-Jui Hsieh (211 papers)
Minhao Cheng (43 papers)
Tianyi Zhou (172 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper presents MuLan, a training-free multimodal LLM that decomposes complex prompts into object-centric tasks for sequential image generation.
It integrates LLM-driven prompt decomposition with VLM feedback to ensure spatial accuracy and attribute consistency in multi-object compositions.
Experimental evaluations on 200 complex prompts show MuLan outperforms baselines in terms of object completeness, attribute binding, and spatial fidelity.

Progressive Multi-Object Generation with a Multimodal LLM (MuLan)

Introduction

The development and refinement of diffusion models have been a cornerstone of progress in generative AI, particularly within the domain of text-to-image (T2I) synthesis. Despite notable achievements, existing state-of-the-art models such as Stable Diffusion and DALL-E struggle with generating images from prompts involving intricate object relations—be it spatial positioning, relative sizes, or attribute consistency. To bridge this gap, we introduce MuLan, a training-free, Multimodal-LLM Agent geared towards progressive multi-object generation, leveraging LLMs for task decomposition and vision-LLMs (VLMs) for iterative feedback control.

Related Work

The emergence of diffusion models has catalyzed breakthroughs in T2I generation, wherein models like Stable Diffusion XL have showcased near-commercial-grade performance. However, their limitation becomes evident when generating complex images with multiple objects. Previous endeavors to improve T2I model controllability have led to approaches that utilize LLMs for layout generation and optimization, but these techniques often fall short in addressing spatial reasoning and layout precision.

The MuLan Framework

MuLan addresses the aforementioned limitations by employing a sequential generation strategy, akin to how a human artist might approach a complex drawing. The process begins with an LLM decomposing a given prompt into manageable object-centric sub-tasks, guiding the generation of one object at a time while considering previously generated content. Each object's generation benefits from attention-guided diffusion, ensuring accurate positioning and attribute adherence. Critically, MuLan introduces a VLM-based feedback loop to correct any deviations from the initial prompt during the generative process. This innovative architecture allows for precise control over the composition of multiple objects, a notable advancement over existing methods.

Experimental Validation

To assess MuLan's efficacy, we compiled a test suite of 200 complex prompts from various benchmarks, analyzing performance across dimensions such as object completeness, attribute binding accuracy, and spatial relationship fidelity. Our findings demonstrate that MuLan significantly outperforms baseline models in these areas, as indicated by both quantitative results and human evaluations. This success underscores the potential of MuLan to redefine the standards for T2I generation, especially in scenarios demanding high degrees of compositional control.

Discussion and Future Directions

The introduction of MuLan represents a pivotal shift towards a more nuanced and capable form of T2I generation. By meticulously combining the strengths of LLMs and VLMs, MuLan not only surmounts the challenges posed by complex prompts but also showcases the untapped potential of multimodal AI collaboration. Looking forward, our work lays the foundational groundwork for further explorations into the synergistic integration of language and visual models, heralding a new era of generative AI that is both more creative and more controlled.

Limitations and Ethical Considerations

While MuLan advances the field of generative AI, its reliance on sequential processing for complex scenes introduces higher computational demands, potentially impacting scalability and efficiency. Additionally, the dependency on LLMs for prompt decomposition may introduce vulnerabilities to inaccuracies in understanding or processing complex prompts. As with all AI research, it is imperative to remain vigilant about the ethical implications, especially concerning the generation of misleading or harmful content. Continuous scrutiny and refinement of models like MuLan are essential to ensure their benefits are realized without unintended negative consequences.

In conclusion, MuLan's ability to navigate the challenges of multi-object T2I generation, backed by empirical validation, not only enhances our understanding of the field but also paves the way for more sophisticated and reliable generative models. Recognizing its potential and limitations will be pivotal in driving future AI research and applications toward more beneficial and ethical outcomes.

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1760137265135681972

https://twitter.com/zhoutianyi/status/1762487809300992245

https://twitter.com/RuochenWang1/status/1762549617399652599

https://twitter.com/knishimae0531/status/1760140564748751141