Emergent Mind

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

(2407.16982)
Published Jul 24, 2024 in cs.CV and cs.AI

Abstract

This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.

Iterative inpainting adds text-guided objects to images while maintaining background consistency.

Overview

  • Diffree presents a new approach for text-guided object addition in images using a diffusion model to maintain visual consistency in lighting, texture, and spatial location without disrupting the image's background.

  • The method incorporates an Object Mask Predictor module and a dataset called Object Addition Benchmark (OABench), facilitating the addition of new objects based on textual descriptions without the need for hand-drawn masks.

  • Evaluation metrics show Diffree significantly outperforms previous methods in terms of success rate, background consistency, location reasonableness, and quality of the generated objects, making it highly effective for practical use cases such as advertising and content creation.

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

In "Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model," the authors present a novel approach to object addition in images using solely text guidance. This paper addresses the challenge of integrating new objects into existing images in a way that maintains consistency in visual context, such as lighting, texture, and spatial location. Existing methods either disrupt the image's background or require cumbersome human-drawn masks. Diffree introduces a diffusion model that predicts the new object's position and integrates it into the image without altering the background undesirably.

Methodology Overview

Diffree extends the capabilities of Text-to-Image (T2I) models by incorporating an object mask predictor module. The authors curated a new dataset, Object Addition Benchmark (OABench), which contains 74K real-world tuples, including original images, inpainted images with objects removed, object masks, and object descriptions. OABench is generated using advanced image inpainting techniques to remove objects from images, capturing the relationship between objects and their context effectively.

The Diffree framework integrates a Stable Diffusion model, augmented with an Object Mask Predictor (OMP) module to achieve text-guided object addition without the need for explicit masks. The diffusion model iteratively denoises latents to generate object masks and subsequently inpaint the specified regions according to textual descriptions. The model is trained using custom loss functions tailored for the diffusion and OMP modules, optimizing the generation of contextually appropriate and visually consistent outputs.

Dataset and Training

OABench's construction leverages existing instance segmentation datasets (e.g., COCO), applying rules to filter and synthesize high-quality training tuples. The process ensures the generated inpainted images retain high background consistency, critical for training the Diffree model.

The model is trained using Stable Diffusion 1.5 weights, with optimizations in batch sizes and learning rates to accommodate the unique demands of text-guided object addition. The training incorporates classifier-free guidance, blending conditional and unconditional diffusion models to balance sample quality and diversity effectively.

Evaluation Metrics

Traditional metrics are insufficient to evaluate this task comprehensively. Instead, the authors introduce a set of evaluation rules leveraging LPIPS for background consistency, GPT4V scores for the reasonableness of object location, Local CLIP Score for text-image correlation, and Local FID for object quality and diversity. A unified metric aggregates these scores with success rates to assess overall performance comprehensively.

Numerical Results

Diffree markedly outperforms prior works in several critical metrics:

  • Success Rate: Achieves over 98% success in adding objects, substantially higher than the 17-19% success rates of InstructPix2Pix.
  • Background Consistency: Shows comparable performance with mask-guided methods (LPIPS ≈ 0.07), highlighting its ability to preserve the original context without explicit masks.
  • Location Reasonableness: Achieves higher GPT4V scores (~3.47), indicating better spatial appropriateness of added objects.
  • Correlation and Quality: Diffree maintains superior Local FID scores (~57-60) and competitive Local CLIP Scores, affirming the generated objects' quality and relevance.

Implications and Future Work

Diffree's method has significant practical and theoretical implications. Practically, it eliminates the need for labor-intensive mask creation, broadening its accessibility and usability in fields such as advertising, content creation, and virtual staging. Theoretically, it advances the understanding of combining diffusion models with auxiliary prediction modules to enhance image editing tasks.

Future developments could explore integrating Diffree with other methods, such as combining it with AnyDoor for specific object addition or with GPT4V for planning object placements in images. Moreover, continuous improvements in image inpainting techniques could further refine Diffree's outputs, ensuring higher fidelity and contextual relevance.

In conclusion, Diffree represents a substantial advancement in text-guided image editing, marrying diffusion models with innovative object mask prediction techniques to achieve high success in object addition while maintaining visual coherence and quality. The paper lays a robust foundation for future enhancements and applications in AI-driven image editing.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube