Emergent Mind

IMAGDressing-v1: Customizable Virtual Dressing

(2407.12705)
Published Jul 17, 2024 in cs.CV

Abstract

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience. However, existing VTON technologies neglect the need for merchants to showcase garments comprehensively, including flexible control over garments, optional faces, poses, and scenes. To address this issue, we define a virtual dressing (VD) task focused on generating freely editable human images with fixed garments and optional conditions. Meanwhile, we design a comprehensive affinity metric index (CAMI) to evaluate the consistency between generated images and reference garments. Then, we propose IMAGDressing-v1, which incorporates a garment UNet that captures semantic features from CLIP and texture features from VAE. We present a hybrid attention module, including a frozen self-attention and a trainable cross-attention, to integrate garment features from the garment UNet into a frozen denoising UNet, ensuring users can control different scenes through text. IMAGDressing-v1 can be combined with other extension plugins, such as ControlNet and IP-Adapter, to enhance the diversity and controllability of generated images. Furthermore, to address the lack of data, we release the interactive garment pairing (IGPair) dataset, containing over 300,000 pairs of clothing and dressed images, and establish a standard pipeline for data assembly. Extensive experiments demonstrate that our IMAGDressing-v1 achieves state-of-the-art human image synthesis performance under various controlled conditions. The code and model will be available at https://github.com/muzishen/IMAGDressing.

Proposed IMAGDressing-v1 framework: garment UNet extracts features, denoising UNet balances with text prompts.

Overview

  • The IMAGDressing-v1 paper introduces a customizable virtual dressing model that allows merchants to generate editable human images with fixed garments, improving flexibility and control compared to existing virtual try-on technologies.

  • Key contributions include the IMAGDressing-v1 model, which combines a garment UNet and hybrid attention mechanism, the development of the Comprehensive Affinity Metric Index (CAMI), and the release of the IGPair dataset for training and evaluation.

  • The IMAGDressing-v1 model achieves superior quantitative and qualitative performance in preserving garment details and adhering to textual prompts, offering significant implications for e-commerce and future AI research in generative models.

Overview of IMAGDressing-v1: Customizable Virtual Dressing

The paper "IMAGDressing-v1: Customizable Virtual Dressing" introduces a novel approach to virtual dressing (VD) that aims to provide comprehensive and personalized clothing displays for merchants. This approach addresses limitations in existing virtual try-on (VTON) technologies that predominantly target consumer scenarios with fixed human conditions, lacking flexibility and editability. Specifically, IMAGDressing-v1 facilitates the generation of freely editable human images featuring fixed garments under optional conditions, including faces, poses, and scenes. The key contributions of this research include the design and implementation of the IMAGDressing-v1 model, the development of the comprehensive affinity metric index (CAMI) for evaluation, and the release of the IGPair dataset for the VD task.

Key Contributions

The primary contributions of this paper can be summarized as follows:

  1. Virtual Dressing (VD) Task: The authors define a new VD task that focuses on generating editable human images with a fixed garment and optional conditions, providing greater flexibility and control for merchants to showcase clothing items.
  2. IMAGDressing-v1 Model: The proposed model combines a garment UNet and a hybrid attention mechanism to capture fine-grained garment features and integrate them with text-based scene control, leveraging latent diffusion models (LDMs) for enhanced image synthesis.
  3. Comprehensive Affinity Metric Index (CAMI): The authors design CAMI to evaluate the consistency between generated images and reference garments, with two components—CAMI-U (unspecified conditions) and CAMI-S (specified conditions).
  4. IGPair Dataset: The release of the IGPair dataset, consisting of over 300,000 pairs of clothing and dressed images, provides a rich resource for training and evaluation in the VD task.

Methodology

The IMAGDressing-v1 model leverages a VAE to compress images into latent space, a CLIP text encoder for conditional text embedding, and a denoising UNet for iterative denoising. The unique aspect of this model is the inclusion of a garment UNet that extracts semantic and texture features from clothing images, and a hybrid attention module that combines both self-attention and cross-attention to integrate garment features into the denoising UNet. The hybrid attention mechanism ensures that the image generation respects both garment details and text prompts.

IMAGDressing-v1 Architecture:

  1. Garment UNet: This component extracts fine-grained semantic and texture features from garment images using a combination of frozen VAE encoder and CLIP image encoder.
  2. Denoising UNet: A frozen UNet similar to the one in Stable Diffusion v1.5 is used for latent space denoising while integrating garment features through hybrid attention.
  3. Hybrid Attention Module: Replaces self-attention modules in the denoising UNet, allowing for the combination of garment UNet’s output with text conditions.

The training process involves optimizing the hybrid attention and image encoder branch while keeping the denoising UNet’s primary weights fixed. This approach maximizes the model’s capacity to generate high-fidelity images that retain garment details and adhere to optional conditions such as poses, faces, and text descriptions.

Quantitative and Qualitative Results

The paper reports that IMAGDressing-v1 excels across several quantitative metrics, including ImageReward and MP-LPIPS, surpassing other state-of-the-art methods. The introduction of CAMI further provides a structured approach to evaluating specific and non-specific conditions in generated images.

Qualitatively, the model demonstrates superior performance in preserving garment details and adherence to textual prompts compared to existing methods like BLIP-Diffusion, Versatile Diffusion, IP-Adapter, and MagicClothing. For specific conditions, such as generating images with given poses or faces, IMAGDressing-v1 shows enhanced compatibility with community resources like ControlNet and IP-Adapter.

Implications and Future Directions

The implications of this research are significant for both practical applications and future theoretical developments in AI. For e-commerce, IMAGDressing-v1 offers merchants a powerful tool to create customizable and high-quality visual content for online platforms, potentially boosting consumer engagement and sales. The model’s flexibility in combining multiple conditions for image generation offers a compelling advantage for personalized marketing.

From a theoretical perspective, the introduction of hybrid attention mechanisms and the integration of multiple feature extraction methods pave the way for more advanced generative models. Future work could explore further enhancements in conditional control, scalability to higher resolutions, and broader applicability of the model to other domains requiring detailed image synthesis.

In conclusion, IMAGDressing-v1 represents a substantial advancement in the field of virtual dressing, addressing critical limitations of previous methods and setting a new benchmark for customizable garment-centric image generation. The release of the IGPair dataset further contributes to the community, providing a robust foundation for future research in this area.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube