Emergent Mind

Abstract

Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.

UNIMO-G model uses multimodal prompts to generate images, highlighting trainable and frozen modules in colors.

Overview

  • UNIMO-G is a novel multimodal conditional diffusion framework that uses both textual and visual inputs for image generation.

  • It integrates a 7 billion parameter Multimodal Large Language Model (MLLM) with a conditional denoising diffusion network to create text-aligned and visually intricate images.

  • The model undergoes two-stage fine-tuning: text-image pair pre-training and multimodal instruction tuning for complex prompts.

  • UNIMO-G's pre-training employs a Latent Diffusion based technique for efficient diffusion process using perceptual compression.

  • The model demonstrates superior performance in generating images from complex prompts on the MS-COCO and MultiBench datasets, outpacing other models.

Introduction

Advancements in text-to-image diffusion models like DALLE-2, Imagen, and Stable Diffusion have efficiently translated textual prompts into images, yet they encounter limitations when detailed and context-specific renditions are required. UNIMO-G, a multimodal conditional diffusion framework, emerges as a solution, uniquely leveraging both textual and visual inputs within prompts for image generation. This novel approach integrates a Multimodal Large Language Model (MLLM) and a conditional denoising diffusion network to produce images that are not only text-aligned but also visually intricate and contextually specific.

Architecture and Training

UNIMO-G encompasses an in-house Chinese MLLM, similar to MiniGPT-4, with a significant 7 billion parameter size, pre-trained on a myriad of text-image pairs. The model is fine-tuned in two stages; initially through a text-image pair pre-training, thereby acquiring conditional image generation competence, followed by multimodal instruction tuning with multimodal prompts. This second phase is pivotal for the model’s ability to accurately translate complex prompts with multiple entities into high-fidelity images. Notably, throughout this training, the MLLM parameters remain frozen, ensuring that the model's perception capabilities are preserved.

Text-to-Image Pre-training

UNIMO-G's pre-training employs a technique akin to Latent Diffusion, relying on a perceptual compression framework to encode images into latents for a computationally efficient diffusion process. Extensive pre-training utilizes sizable Chinese text-image pairs, gradually improving the model through an incremental training regimen that begins with foundational knowledge acquisition from a smaller corpus and then advances to training with a large-scale, diverse dataset, culminating with a highly curated selection for refinement.

Multimodal Instruction Tuning

This phase is fundamental for adapting UNIMO-G to render images from multimodal inputs. By enhancing the cross-attention mechanism, the model is now capable of capitalizing on the visual characteristics of multimodal input, thus rendering images that correlate accurately with the specifics of the visual content in the prompts. The introduction of a novel loss component ensures the generated noisy latent space adequately emphasizes image regions correlating with the represented objects, thereby enforcing a visual congruence between the input prompt's visual elements and the synthesized image.

Evaluation and Results

UNIMO-G's proficiency extends to both text-to-image and zero-shot subject-driven image generation, as substantiated by evaluations on the MS-COCO and MultiBench datasets. Its performance is notably superior in scenarios involving subject-driven synthesis, proving its ability to process complex prompts involving multiple entities while maintaining both visual fidelity and textual relevance. Comparative human evaluation further underscores UNIMO-G's enhanced image quality against its contemporaries, spotlighting its strength in multi-entity scenarios.

Conclusion

In synthesis, UNIMO-G sets a benchmark in controlled image generation, adeptly managing the intricacies of multimodal prompts and demonstrating significant improvements over other models in terms of fidelity and relevance. The successful marriage of MLLM for encoding with a diffusion network for generation paves the way for more sophisticated text-to-image models capable of handling nuanced and composite visual prompts.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.