$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space (2402.05195v2)

Published 7 Feb 2024 in cs.CV and cs.CL

Abstract: Despite the recent advances in personalized text-to-image (P-T2I) generative models, it remains challenging to perform finetuning-free multi-subject-driven T2I in a resource-efficient manner. Predominantly, contemporary approaches, involving the training of Hypernetworks and Multimodal LLMs (MLLMs), require heavy computing resources that range from 600 to 12300 GPU hours of training. These subject-driven T2I methods hinge on Latent Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the latent space of these diffusion models significantly escalates resource demands, leading to inconsistent results and necessitating numerous iterations for a single desired image. In this paper, we present $\lambda$-ECLIPSE, an alternative prior-training strategy that works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models. $\lambda$-ECLIPSE leverages the image-text interleaved pre-training for fast and effective multi-subject-driven P-T2I. Through extensive experiments, we establish that $\lambda$-ECLIPSE surpasses existing baselines in composition alignment while preserving concept alignment performance, even with significantly lower resource utilization. $\lambda$-ECLIPSE performs multi-subject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours. Additionally, $\lambda$-ECLIPSE demonstrates the unique ability to perform multi-concept interpolations.

Citations (14)

View on Semantic Scholar

Summary

The paper presents a novel, resource-efficient model (λ-ECLIPSE) achieving personalized T2I synthesis using only 34M parameters and 74 GPU hours.
The approach integrates a unique training strategy with image-text interleaved data and edge-guided controls to enhance composition and concept alignment.
Extensive experiments demonstrate that λ-ECLIPSE outperforms baselines in multi-subject interpolation and accurately generates detailed images from text prompts.

Overview

The paper introduces λ-ECLIPSE, a groundbreaking approach to personalized text-to-image (P-T2I) generation that leverages the CLIP latent space for efficient, multi-concept image synthesis. Traditional P-T2I models encounter challenges such as high resource demands and inconsistency in output due to their dependence on the latent space of diffusion models. λ-ECLIPSE bypasses these hurdles by operating within a pre-trained CLIP latent space, enabling the model to perform single, multi-subject, and edge-guided T2I personalization with substantially lower resource requirements.

Key Innovations

Efficiency and Resource Reduction

One of the most significant contributions of λ-ECLIPSE is its efficiency. The model operates with just 34M parameters and requires only 74 GPU hours for training, utilizing 1.6M image-text pairs. This represents a considerable reduction in computational resources compared to existing methods, which often rely on models with parameters in the billions and extensive training times.

Enhanced Performance

Despite its efficiency, λ-ECLIPSE demonstrates superior performance in composition alignment without sacrificing conceptual accuracy. Through extensive experiments, the model outperforms existing baselines in composition alignment while maintaining competitive concept alignment, even with significantly reduced resource utilization.

Novel Training Approach

The paper outlines a novel pre-training strategy involving image-text interleaved data. By substituting text tokens with image embeddings, λ-ECLIPSE is trained to estimate image embeddings that harmonize with text semantics, encapsulating subject representations in the process. This method allows the model to generate images that not only adhere to textual prompts but also accurately represent subjects in varied contexts.

Edge-Guided Personalization

An extension of λ-ECLIPSE incorporates canny edge maps as auxiliary guides, further enhancing its control over subject-driven T2I generation. This addition allows for more precise control and broadens the model's capability to generate images that respect edge information, thereby improving the fidelity of the generated images to their textual descriptions.

Analyzing the Results

The paper presents both qualitative and quantitative results to underscore λ-ECLIPSE's effectiveness. Quantitative analyses, based on the Dreambench and ConceptBed benchmarks, show the model's strong performance in concept and composition alignments. Qualitative comparisons further reveal the model's ability to generate highly realistic and diverse images that closely adhere to provided text prompts.

Additionally, λ-ECLIPSE achieves remarkable results in multi-subject interpolation. By leveraging the smooth latent space inherited from CLIP, the model is capable of creating seamless transitions between disparate concepts, a feature that amplifies its utility for personalized T2I tasks.

Conclusion and Forward Look

λ-ECLIPSE represents a significant advancement in P-T2I generation, offering a resource-efficient yet highly effective solution for creating personalized images from textual descriptions. Its capability to perform single and multi-subject generation, coupled with edge-guided personalization and the ability for smooth concept interpolations, sets a new benchmark in the field. The paper's insights into maximizing the utility of pre-trained models without extensive supervision pave the way for future research in generative AI, particularly in optimizing resource use while enhancing output fidelity and diversity.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1755798954476806422

https://twitter.com/LeopolisDream/status/1761291932892823800

https://twitter.com/TheTuringPost/status/1757096132159869138

https://twitter.com/knishimae0531/status/1756144310704386557