L-MAGIC: Language Model Assisted Generation of Images with Coherence

Published 3 Jun 2024 in cs.CV | (2406.01843v1)

Abstract: In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging LLMs for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and LLMs without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works, with >70% preference in human evaluations. Combined with conditional diffusion models, L-MAGIC can accept various input modalities, including but not limited to text, depth maps, sketches, and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. Code is available at https://github.com/IntelLabs/MMPano. The video presentation is available at https://youtu.be/XDMNEzH4-Ec?list=PLG9Zyvu7iBa0-a7ccNLO8LjcVRAoMn57s.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces L-MAGIC, an innovative framework that uses large language models to guide diffusion for coherent panoramic scene generation.
It employs iterative warping-and-inpainting with both positive and negative prompts to prevent object duplication and ensure seamless scene extension.
L-MAGIC outperforms state-of-the-art methods, achieving over 70% human preference and superior Inception Scores in image-to-panorama and text-to-panorama tasks.

L-MAGIC: Enhanced Panoramic Scene Generation With LLM Guidance

The paper "L-MAGIC: LLM Assisted Generation of Images with Coherence" introduces an innovative method for generating panoramic scenes from a single input image. This research addresses the ongoing challenge in computer vision of creating coherent and realistic 360-degree panoramic images, which is a crucial capability for applications in fields such as architectural design, movie scene creation, and virtual reality.

Methodological Contributions

The paper proposes a method known as L-MAGIC, which leverages the capabilities of LLMs to guide the diffusion process in multi-view image generation. The novelty lies in the application of pre-trained LLMs, such as ChatGPT and BLIP-2, to provide scene layout priors, facilitating a coherent extension of the local scene content to a full 360-degree panorama without necessitating additional fine-tuning of the models. This approach addresses common issues in previous methods, like the duplication of objects across views and the requirement for iterative manual input, by introducing a framework for automatic coherent view generation.

The methodology is based on iterative warping-and-inpainting, combined with sophisticated prompt generation for LLMs to interact seamlessly with diffusion models such as Stable Diffusion v2. Importantly, L-MAGIC uses LLMs to ensure that objects are not duplicated across views by guiding the diffusion model with both positive and negative prompts. Moreover, to enhance the quality and resolution of the output, the paper introduces super-resolution techniques and smoothing strategies for blending multiple views.

Experimental Evaluation

The paper supports its claims through comprehensive evaluations against state-of-the-art methods on both image-to-panorama and text-to-panorama tasks. Notable results include a human preference rate higher than 70% for L-MAGIC generated scenes over baselines like Text2room and MVDiffusion, signifying clearly superior output quality and scene layout coherence. This preference is reflected in the Inception Score metrics, which L-MAGIC consistently outperformed.

Implications and Future Directions

From an application standpoint, L-MAGIC represents a significant step forward in generating panoramic images with practical implications in virtual reality and design simulation. The approach's ability to incorporate various input modalities via conditional diffusion models broadens its applicability, allowing for input forms such as sketches, depth maps, and more. Furthermore, the potential to produce 3D point clouds and immersive scene fly-throughs from this panoramic data highlights the method's versatility. This opens avenues for additional research focused on integrating fine-grained control over scene elements and extending this approach to dynamic scenes, potentially impacting interactive applications.

In conclusion, L-MAGIC demonstrates the power of integrating LLMs into multi-view image generation workflows, leading to innovative solutions for long-standing challenges in computer vision. Future research could benefit from further refinement of scene layout mechanisms and exploration into the automation of layout encoding, thereby enhancing both the realism and applicability of AI-generated environments.

Markdown Report Issue