Emergent Mind

Magic-Me: Identity-Specific Video Customized Diffusion

(2402.09368)
Published Feb 14, 2024 in cs.CV and cs.AI

Abstract

Creating content with specified identities (ID) has attracted significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven creation has achieved great progress with the identity controlled via reference images. However, its extension to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation at the initialization stage for stable video outputs. To achieve this, we propose three novel components that are essential for high-quality identity preservation and stable video generation: 1) a noise initialization method with 3D Gaussian Noise Prior for better inter-frame stability; 2) an ID module based on extended Textual Inversion trained with the cropped identity to disentangle the ID information from the background 3) Face VCD and Tiled VCD modules to reinforce faces and upscale the video to higher resolution while preserving the identity's features. We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines. Besides, with the transferability of the encoded identity in the ID module, VCD is also working well with personalized text-to-image models available publicly. The codes are available at https://github.com/Zhen-Dong/Magic-Me.

Overview

  • VCD introduces an innovative framework for generating videos that precisely maintain the subject's identity throughout various scenarios by integrating novel components like an ID module, 3D Gaussian Noise Prior, and V2V modules.

  • The ID module focuses on capturing compact identity features from cropped images into text tokens to ensure identity preservation and consistency across video frames.

  • 3D Gaussian Noise Prior and V2V modules, including Face VCD and Tiled VCD, work together to enhance frame consistency and resolution, addressing the common challenge of maintaining clear facial features in video content.

  • Experimental validation demonstrates VCD's superiority in generating stable, high-quality videos with preserved identities against strong baselines, suggesting its potential in personalized content creation and digital marketing.

Video Custom Diffusion for Identity-Specific Video Generation

Introduction to Video Custom Diffusion (VCD)

The paper presents Video Custom Diffusion (VCD), an innovative framework designed for identity-specific video generation that significantly improves the preservation and alignment of subject identities across video frames. By implementing three novel components: an ID module, a 3D Gaussian Noise Prior for enhanced frame consistency, and video-to-video (V2V) modules for quality enhancement, VCD demonstrates an advanced capability in generating high-quality videos which faithfully maintain the predefined subject identity throughout dynamic scenarios and motions.

Key Components of VCD

The architecture of VCD integrates several key innovations to address challenges in identity-specific video generation:

  • ID Module: This module is trained with images cropped to solely contain the subject identity, enabling the precise capture of identity features into compact text tokens. These enhanced tokens facilitate the reliable translation of specific identity features across video frames, setting VCD apart in terms of identity preservation and consistency.
  • 3D Gaussian Noise Prior: For improving inter-frame consistency, VCD uses a novel noise prior that establishes correlation between the input frames from the outset, ensuring that all frames are initialized in a manner that promotes temporal stability and coherent identity depiction throughout the video.
  • V2V Modules: These include Face VCD and Tiled VCD, aimed at denoising and upscaling for higher resolution. The use of these modules is a pragmatic approach to compensating for the resolution limitations inherent in diffusion models, particularly when attempting to portray clear facial features across various distances within video content.

Experimental Validation and Results

Through meticulous experiments, VCD has been validated against strong baselines, showcasing superior capability in generating stable and high-quality videos with accurately preserved identities. The versatility of the ID module facilitates seamless integration with publicly available text-to-image models, enhancing the framework's applicability and performance further.

The fusion of the proposed components allows VCD to effectively mitigate common issues in video generation, such as inconsistent identity portrayal and fluctuating video backgrounds, which have been persistent obstacles in prior research efforts.

Implications and Future Developments

The introduction of VCD represents a significant step forward in the realm of generative AI, particularly in applications demanding high fidelity in identity preservation across videos—ranging from personalized content creation to digital marketing. The framework not only elevates the standard for identity-specific video generation but also opens avenues for future research in areas such as multi-identity interaction within videos and extending video duration without compromising quality or consistency.

Conclusion

VCD emerges as a comprehensive and effective solution for identity-specific video generation, backed by its novel components and extensive experimental validation. Its ability to produce high-quality, identity-consistent video content efficiently positions it as a valuable tool for both research and practical applications in generative AI. As the field continues to evolve, the foundational principles and methodologies introduced by VCD will undoubtedly influence future advancements in video generation technologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube