Magic-Me: Identity-Specific Video Customized Diffusion (2402.09368v2)

Published 14 Feb 2024 in cs.CV and cs.AI

Abstract: Creating content with specified identities (ID) has attracted significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven creation has achieved great progress with the identity controlled via reference images. However, its extension to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation at the initialization stage for stable video outputs. To achieve this, we propose three novel components that are essential for high-quality identity preservation and stable video generation: 1) a noise initialization method with 3D Gaussian Noise Prior for better inter-frame stability; 2) an ID module based on extended Textual Inversion trained with the cropped identity to disentangle the ID information from the background 3) Face VCD and Tiled VCD modules to reinforce faces and upscale the video to higher resolution while preserving the identity's features. We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines. Besides, with the transferability of the encoded identity in the ID module, VCD is also working well with personalized text-to-image models available publicly. The codes are available at https://github.com/Zhen-Dong/Magic-Me.

Authors (9)

Ze Ma (8 papers)
Daquan Zhou (47 papers)
Chun-Hsiao Yeh (7 papers)
Xue-She Wang (5 papers)
Xiuyu Li (24 papers)
Huanrui Yang (37 papers)
Zhen Dong (87 papers)
Kurt Keutzer (200 papers)
Jiashi Feng (295 papers)

Citations (17)

View on Semantic Scholar

Summary

The paper introduces Video Custom Diffusion (VCD), significantly advancing identity preservation in dynamic video generation.
It employs an ID module, a 3D Gaussian Noise Prior, and V2V modules to enhance frame consistency and video quality.
Experimental results demonstrate that VCD outperforms baselines, paving the way for personalized video applications in AI.

Video Custom Diffusion for Identity-Specific Video Generation

Introduction to Video Custom Diffusion (VCD)

The paper presents Video Custom Diffusion (VCD), an innovative framework designed for identity-specific video generation that significantly improves the preservation and alignment of subject identities across video frames. By implementing three novel components: an ID module, a 3D Gaussian Noise Prior for enhanced frame consistency, and video-to-video (V2V) modules for quality enhancement, VCD demonstrates an advanced capability in generating high-quality videos which faithfully maintain the predefined subject identity throughout dynamic scenarios and motions.

Key Components of VCD

The architecture of VCD integrates several key innovations to address challenges in identity-specific video generation:

ID Module: This module is trained with images cropped to solely contain the subject identity, enabling the precise capture of identity features into compact text tokens. These enhanced tokens facilitate the reliable translation of specific identity features across video frames, setting VCD apart in terms of identity preservation and consistency.
3D Gaussian Noise Prior: For improving inter-frame consistency, VCD uses a novel noise prior that establishes correlation between the input frames from the outset, ensuring that all frames are initialized in a manner that promotes temporal stability and coherent identity depiction throughout the video.
V2V Modules: These include Face VCD and Tiled VCD, aimed at denoising and upscaling for higher resolution. The use of these modules is a pragmatic approach to compensating for the resolution limitations inherent in diffusion models, particularly when attempting to portray clear facial features across various distances within video content.

Experimental Validation and Results

Through meticulous experiments, VCD has been validated against strong baselines, showcasing superior capability in generating stable and high-quality videos with accurately preserved identities. The versatility of the ID module facilitates seamless integration with publicly available text-to-image models, enhancing the framework's applicability and performance further.

The fusion of the proposed components allows VCD to effectively mitigate common issues in video generation, such as inconsistent identity portrayal and fluctuating video backgrounds, which have been persistent obstacles in prior research efforts.

Implications and Future Developments

The introduction of VCD represents a significant step forward in the field of generative AI, particularly in applications demanding high fidelity in identity preservation across videos—ranging from personalized content creation to digital marketing. The framework not only elevates the standard for identity-specific video generation but also opens avenues for future research in areas such as multi-identity interaction within videos and extending video duration without compromising quality or consistency.

Conclusion

VCD emerges as a comprehensive and effective solution for identity-specific video generation, backed by its novel components and extensive experimental validation. Its ability to produce high-quality, identity-consistent video content efficiently positions it as a valuable tool for both research and practical applications in generative AI. As the field continues to evolve, the foundational principles and methodologies introduced by VCD will undoubtedly influence future advancements in video generation technologies.

PDF Markdown

Related Papers

GitHub

GitHub - Zhen-Dong/Magic-Me: Codes for Identity-Specific Video Customized Diffusion (454 stars)

Tweets

https://twitter.com/_akhaliq/status/1757958706971463831

https://twitter.com/arankomatsuzaki/status/1757947607954137243

https://twitter.com/Evinst3in/status/1761992463902703657

https://twitter.com/taziku_co/status/1758125246920769704

https://twitter.com/arxivsanitybot/status/1758120103554527550

https://twitter.com/gzlin/status/1758538982730707156

YouTube

Show All Videos