UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion (2401.13388v3)

Published 24 Jan 2024 in cs.CV

Abstract: Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal LLM (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.

References (35)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces UNIMO-G, a multimodal conditional diffusion model that fuses text and visual cues to generate detailed, context-specific images.
It employs a two-stage training approach using a frozen MLLM and a conditional denoising diffusion network, ensuring robust perceptual encoding and refined image synthesis.
Evaluations on datasets like MS-COCO highlight UNIMO-G’s superior performance in subject-driven image generation and handling complex, multi-entity prompts.

Introduction

Advancements in text-to-image diffusion models like DALLE-2, Imagen, and Stable Diffusion have efficiently translated textual prompts into images, yet they encounter limitations when detailed and context-specific renditions are required. UNIMO-G, a multimodal conditional diffusion framework, emerges as a solution, uniquely leveraging both textual and visual inputs within prompts for image generation. This novel approach integrates a Multimodal LLM (MLLM) and a conditional denoising diffusion network to produce images that are not only text-aligned but also visually intricate and contextually specific.

Architecture and Training

UNIMO-G encompasses an in-house Chinese MLLM, similar to MiniGPT-4, with a significant 7 billion parameter size, pre-trained on a myriad of text-image pairs. The model is fine-tuned in two stages; initially through a text-image pair pre-training, thereby acquiring conditional image generation competence, followed by multimodal instruction tuning with multimodal prompts. This second phase is pivotal for the model’s ability to accurately translate complex prompts with multiple entities into high-fidelity images. Notably, throughout this training, the MLLM parameters remain frozen, ensuring that the model's perception capabilities are preserved.

Text-to-Image Pre-training

UNIMO-G's pre-training employs a technique akin to Latent Diffusion, relying on a perceptual compression framework to encode images into latents for a computationally efficient diffusion process. Extensive pre-training utilizes sizable Chinese text-image pairs, gradually improving the model through an incremental training regimen that begins with foundational knowledge acquisition from a smaller corpus and then advances to training with a large-scale, diverse dataset, culminating with a highly curated selection for refinement.

Multimodal Instruction Tuning

This phase is fundamental for adapting UNIMO-G to render images from multimodal inputs. By enhancing the cross-attention mechanism, the model is now capable of capitalizing on the visual characteristics of multimodal input, thus rendering images that correlate accurately with the specifics of the visual content in the prompts. The introduction of a novel loss component ensures the generated noisy latent space adequately emphasizes image regions correlating with the represented objects, thereby enforcing a visual congruence between the input prompt's visual elements and the synthesized image.

Evaluation and Results

UNIMO-G's proficiency extends to both text-to-image and zero-shot subject-driven image generation, as substantiated by evaluations on the MS-COCO and MultiBench datasets. Its performance is notably superior in scenarios involving subject-driven synthesis, proving its ability to process complex prompts involving multiple entities while maintaining both visual fidelity and textual relevance. Comparative human evaluation further underscores UNIMO-G's enhanced image quality against its contemporaries, spotlighting its strength in multi-entity scenarios.

Conclusion

In synthesis, UNIMO-G sets a benchmark in controlled image generation, adeptly managing the intricacies of multimodal prompts and demonstrating significant improvements over other models in terms of fidelity and relevance. The successful marriage of MLLM for encoding with a diffusion network for generation paves the way for more sophisticated text-to-image models capable of handling nuanced and composite visual prompts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1750353584854643090

https://twitter.com/AILucknow/status/1750362966397235603

https://twitter.com/fly51fly/status/1750636890511036760