ID-Animator: Zero-Shot Identity-Preserving Human Video Generation (2404.15275v3)

Published 23 Apr 2024 in cs.CV

Abstract: Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or usually missing identity details in the video generation process. In this study, we present \textbf{ID-Animator}, a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline that incorporates unified human attributes and action captioning techniques from a constructed facial image pool. Based on this pipeline, a random reference training strategy is further devised to precisely capture the ID-relevant embeddings with an ID-preserving loss, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints are released at https://github.com/ID-Animator/ID-Animator.

References (39)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces ID-Animator, a zero-shot framework combining diffusion-based video generation and a face adapter to maintain input identity.
It implements a novel ID-oriented dataset pipeline that decouples human attributes and actions to enhance identity extraction during training.
Experimental results demonstrate superior identity fidelity and generalization, underscoring its practical applicability in content creation.

Zero-Shot Identity-Preserving Human Video Generation with ID-Animator

The paper presents ID-Animator, a zero-shot approach capable of generating personalized human videos while preserving the identity of the input facial image without additional model tuning. This work addresses key challenges in identity-specific video generation, particularly the balancing act between training efficiency and identity fidelity, by leveraging a diffusion-based video generation architecture augmented with a face adapter module.

Methodology

ID-Animator operates on a robust framework that integrates pre-trained text-to-video diffusion models with a lightweight face adapter. This adapter encodes identity-relevant embeddings from input facial images. The paper also introduces an ID-oriented dataset construction pipeline, facilitating identity extraction through decoupled human attribute and action captioning. This is further enhanced by a random face reference training method, which improves model fidelity and generalization by isolating identity-related features from extraneous details in reference images.

Dataset Construction

A significant contribution of this paper is the ID-oriented dataset reconstruction, based on publicly available datasets. The authors implement a decoupled captioning strategy, isolating human attributes and actions to generate comprehensive textual descriptions. This is coupled with a facial image pool to provide more precise facial embeddings. The dataset construction pipeline overcomes the dearth of suitable high-quality training sets for identity-preserving video generation.

Experimental Results

The extensive experiments conducted demonstrate the superiority of ID-Animator in generating high-fidelity, identity-preserving videos when benchmarked against existing methods. The compatibility of ID-Animator with various pre-trained text-to-video models like AnimateDiff, along with its adaptability to community models, underscores its practical applicability. The framework’s extendability in real-world video generation scenarios is particularly notable, allowing for significant flexibility in integrating with other models to achieve desired generative outcomes.

Implications and Future Directions

This research holds substantial implications for fields such as film production, where identity fidelity in character portrayal is crucial. By enabling efficient and faithful identity-specific video generation without per-character tuning, ID-Animator paves the way for streamlined content creation pipelines. The paper also points towards future exploration in enhancing the robustness of zero-shot models, potentially incorporating more sophisticated facial recognition and attribute extraction techniques to broaden applicability across diverse identity-specific tasks.

Conclusion

ID-Animator represents a significant advancement in zero-shot human video generation, combining efficiency and fidelity in maintaining character identity. This work lays a foundation for future AI-driven innovations in personalized content generation, offering a practical solution to long-standing challenges in identity preservation during video synthesis. The release of the code and checkpoints aligns with fostering further research and development in the domain.

PDF Markdown

Related Papers

GitHub

GitHub - ID-Animator/ID-Animator (286 stars)

Tweets

https://twitter.com/_vztu/status/1806053911574532286

YouTube

Show All Videos