MegActor: Harness the Power of Raw Video for Vivid Portrait Animation (2405.20851v2)

Published 31 May 2024 in cs.CV

Abstract: Despite raw driving videos contain richer information on facial expressions than intermediate representations such as landmarks in the field of portrait animation, they are seldom the subject of research. This is due to two challenges inherent in portrait animation driven with raw videos: 1) significant identity leakage; 2) Irrelevant background and facial details such as wrinkles degrade performance. To harnesses the power of the raw videos for vivid portrait animation, we proposed a pioneering conditional diffusion model named as MegActor. First, we introduced a synthetic data generation framework for creating videos with consistent motion and expressions but inconsistent IDs to mitigate the issue of ID leakage. Second, we segmented the foreground and background of the reference image and employed CLIP to encode the background details. This encoded information is then integrated into the network via a text embedding module, thereby ensuring the stability of the background. Finally, we further style transfer the appearance of the reference image to the driving video to eliminate the influence of facial details in the driving videos. Our final model was trained solely on public datasets, achieving results comparable to commercial models. We hope this will help the open-source community.The code is available at https://github.com/megvii-research/MegFaceAnimate.

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a diffusion-based model that uses raw video to animate portraits while preventing identity leakage and background interference.
It employs synthetic data generation with face-swapping, stylization, and CLIP-driven segmentation to maintain subject identity and stable backgrounds.
Empirical results demonstrate that MegActor achieves state-of-the-art animation quality using publicly available datasets.

An Exposition of MegActor: Using Raw Video for Portrait Animation

This essay explores the research work titled "MegActor: Harness the Power of Raw Video for Vivid Portrait Animation" by Shurong Yang et al., which addresses the task of portrait animation — the transference of facial expressions and motions from a driving video to a static reference portrait without compromising the identity and background of the latter.

Introduction and Challenges

Portrait animation is increasingly significant in applications such as digital avatars and AI-driven conversations. Previous methodologies often leveraged Generative Adversarial Networks (GANs) or Neural Radiance Fields (NeRFs) but these approaches suffered from issues such as unrealistic renderings and artifacts like blurs and flickers. The overarching challenge with using raw driving videos lies in two critical areas: Identity leakage, where the model replicates details from the driving video, and the interference of irrelevant background or facial details such as wrinkles.

Methodological Innovations

MegActor introduces a novel diffusion-based model explicitly designed to exploit raw video inputs, mitigating the aforementioned challenges. It incorporates a synthetic data generation framework to create videos where motion and expressions are constant, but identities are varied — effectively curbing identity leakage. The framework leverages:

Face-swapping and Stylization: Techniques are employed to generate variations in identity while preserving motion dynamics, utilizing tools like Face-Fusion and the SDXL model.
Background and Detail Management: By segmenting the background and using the Contrastive Language-Image Pre-training (CLIP) model to ensure background stability, MegActor effectively manages extraneous details from driving videos.

Numerical Results and Observations

Empirical evidence showcases that MegActor’s approach yields animations with natural expressions and motion dynamics, achieving results comparable to state-of-the-art (SOTA) commercial models despite being trained solely on public datasets. According to the paper’s experiments, the produced animations display robustness across different identities and video sources.

Implications and Future Challenges

The implications of MegActor are twofold; practically, it opens pathways to deploy high-quality, identity-preserving animations in commercial and open-source spaces. Theoretically, it reinforces the potential of conditional diffusion models in enhancing synthetic data generation duplicity.

Nonetheless, future directions remain to further diminish artifacts in delicate facial regions such as hairlines and mouth dynamics, and to explore the interaction between facial attribute variations and overall video quality. Moreover, integrating a robust video generation backbone, such as SDXL, holds promise for further advancements.

Conclusion

The MegActor framework presents a sophisticated and methodical approach to portrait animation, introducing a conditional diffusion model that uniquely exploits raw driving video inputs to synthesize expressive and vivid animations. It stands as an influencing milestone in the domain of portrait animation and AI-driven visual generation tasks, setting a precedent for reproducibility and quality in the utilization of publicly available datasets.

This exploration of MegActor thus highlights not just the capabilities of current techniques in the field but also posits significant opportunities for further research and development within AI-generated animation and video synthesis.

PDF Markdown

Related Papers

GitHub

GitHub - megvii-research/MegFaceAnimate (95 stars)

Tweets

https://twitter.com/papers_anon/status/1808049898748817550

https://twitter.com/CSVisionPapers/status/1797768671957311743

YouTube

Show All Videos