LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control (2407.03168v2)

Published 3 Jul 2024 in cs.CV

Abstract: Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation. Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability. Building upon this, we develop a video-driven portrait animation framework named LivePortrait with a focus on better generalization, controllability, and efficiency for practical usage. To enhance the generation quality and generalization ability, we scale up the training data to about 69 million high-quality frames, adopt a mixed image-video training strategy, upgrade the network architecture, and design better motion transformation and optimization objectives. Additionally, we discover that compact implicit keypoints can effectively represent a kind of blendshapes and meticulously propose a stitching and two retargeting modules, which utilize a small MLP with negligible computational overhead, to enhance the controllability. Experimental results demonstrate the efficacy of our framework even compared to diffusion-based methods. The generation speed remarkably reaches 12.8ms on an RTX 4090 GPU with PyTorch. The inference code and models are available at https://github.com/KwaiVGI/LivePortrait

Citations (17)

View on Semantic Scholar

Summary

The paper introduces a novel implicit-keypoint-based framework that enhances both the efficiency and control of portrait animation.
It employs a mixed training strategy with 69 million frames and improved network architectures to achieve superior self- and cross-reenactment performance.
The inclusion of stitching and MLP-based retargeting modules enables precise control of facial features, advancing realism in real-time applications.

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

The paper "LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control" by Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, et al. introduces an innovative framework for animating static portrait images, prioritizing both realism and computational efficiency. The proposed method diverges from mainstream diffusion-based approaches, instead extending the capabilities of the implicit-keypoint-based framework. This paper makes significant strides in enhancing the generalization, controllability, and efficiency of portrait animation systems.

Key Contributions

The core contributions of the paper include:

Implicit-Keypoint-Based Framework: Leveraging compact implicit keypoints as the motion representation balance computational efficiency and precise control.
Scalable Training Data: Utilizing a large-scale dataset of approximately 69 million high-quality frames and adopting a mixed image-video training strategy.
Network Architecture Improvements: Enhancing the network components and proposing improved motion transformation and optimization objectives.
Stitching and Retargeting Modules: Introducing low-overhead modules for stitching and precise control of eye and lip movements.

Methodology

The paper's methodology is rooted in several impactful enhancements to the traditional implicit-keypoint-based framework:

Data Curation and Mixed Training:
- The authors curated a vast and diverse training dataset comprising public video datasets, proprietary 4K resolution portrait clips, and styled portrait images.
- A novel mixed training strategy allows the model to leverage both static images and dynamic videos, enhancing generalization capabilities to various portrait styles.
Network Upgrades:
- Integration of the canonical implicit keypoint detector, head pose estimation, and expression deformation networks into a unified model using ConvNeXt-V2-Tiny as a backbone.
- Incorporation of SPADE Decoder for the generator to enhance animated image quality and resolution.
Scalable Motion Transformation:
- Inclusion of a scaling factor in motion transformation, balancing the flexibility and stability of expression deformations.
Landmark-Guided Optimization:
- Introduction of a landmark-guided loss to refine the learning of implicit keypoints, focusing particularly on subtle facial movements like eye gaze adjustments.
Cascaded Loss Terms:
- Implementation of multi-region perceptual and GAN losses, alongside a face-id loss and the landmark-guided loss to improve both identity preservation and animation quality.

Stitching and Retargeting

The framework includes sophisticated modules for stitching and retargeting that allow for enhanced controllability with minimal computational overhead:

Stitching Module:
- The stitching module mitigates pixel misalignment, enabling accurate reconstruction of the animated region onto the original image space.
Eyes and Lip Retargeting:
- Two MLP-based modules allow controlling the extent of eye and lip movements independently, promoting realistic and expressive animations.

Experimental Results

Self-Reenactment: - The model exhibits superior performance in self-reenactment tasks, preserving appearance details and effectively transferring facial motions.

Cross-Reenactment: - In cross-reenactment scenarios, LivePortrait demonstrates commendable capabilities in maintaining identity and transferring subtle facial expressions, outperforming existing diffusion-based models in efficiency and, in some cases, quality metrics.

Quantitative Metrics: - The paper details extensive quantitative evaluations where LivePortrait excels across multiple benchmarks, including PSNR, SSIM, LPIPS, FID, AED, and APD.

Implications and Future Work

The practical implications of this work are vast, potentially advancing applications in video conferencing, social media, and entertainment. By achieving real-time performance on a high-end GPU, LivePortrait sets the stage for accessible and efficient portrait animation.

However, the paper acknowledges limitations in handling large pose variations and anticipates further research to improve stability under significant motion conditions.

Conclusions

In summary, "LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control" provides a substantial advancement in portrait animation technology. By innovatively combining implicit-keypoint representations, scalable training practices, and advanced control mechanisms, the authors set a new benchmark for efficiency and quality in portrait animation systems. The research opens avenues for real-time, high-fidelity animation in a variety of practical applications.