Emergent Mind

Abstract

Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation. Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability. Building upon this, we develop a video-driven portrait animation framework named LivePortrait with a focus on better generalization, controllability, and efficiency for practical usage. To enhance the generation quality and generalization ability, we scale up the training data to about 69 million high-quality frames, adopt a mixed image-video training strategy, upgrade the network architecture, and design better motion transformation and optimization objectives. Additionally, we discover that compact implicit keypoints can effectively represent a kind of blendshapes and meticulously propose a stitching and two retargeting modules, which utilize a small MLP with negligible computational overhead, to enhance the controllability. Experimental results demonstrate the efficacy of our framework even compared to diffusion-based methods. The generation speed remarkably reaches 12.8ms on an RTX 4090 GPU with PyTorch. The inference code and models are available at https://github.com/KwaiVGI/LivePortrait

Stitching and retargeting module optimization after freezing appearance, motion extractors, wrapping module, and decoder.

Overview

  • The paper introduces LivePortrait, a framework for animating static portrait images with a focus on realism and computational efficiency, diverging from traditional diffusion-based methods.

  • Core contributions include using an implicit-keypoint-based framework, scalable training data comprising 69 million high-quality frames, and network architecture improvements that enhance motion transformation and optimization.

  • Innovations also include stitching and retargeting modules for fine eye and lip movement control, with the entire system proving superior in self-reenactment and cross-reenactment scenarios, achieving real-time performance on high-end GPUs.

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

The paper "LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control" by Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, et al. introduces an innovative framework for animating static portrait images, prioritizing both realism and computational efficiency. The proposed method diverges from mainstream diffusion-based approaches, instead extending the capabilities of the implicit-keypoint-based framework. This paper makes significant strides in enhancing the generalization, controllability, and efficiency of portrait animation systems.

Key Contributions

The core contributions of the paper include:

  1. Implicit-Keypoint-Based Framework: Leveraging compact implicit keypoints as the motion representation balance computational efficiency and precise control.
  2. Scalable Training Data: Utilizing a large-scale dataset of approximately 69 million high-quality frames and adopting a mixed image-video training strategy.
  3. Network Architecture Improvements: Enhancing the network components and proposing improved motion transformation and optimization objectives.
  4. Stitching and Retargeting Modules: Introducing low-overhead modules for stitching and precise control of eye and lip movements.

Methodology

The paper's methodology is rooted in several impactful enhancements to the traditional implicit-keypoint-based framework:

Data Curation and Mixed Training:

  • The authors curated a vast and diverse training dataset comprising public video datasets, proprietary 4K resolution portrait clips, and styled portrait images.
  • A novel mixed training strategy allows the model to leverage both static images and dynamic videos, enhancing generalization capabilities to various portrait styles.

Network Upgrades:

  • Integration of the canonical implicit keypoint detector, head pose estimation, and expression deformation networks into a unified model using ConvNeXt-V2-Tiny as a backbone.
  • Incorporation of SPADE Decoder for the generator to enhance animated image quality and resolution.

Scalable Motion Transformation:

  • Inclusion of a scaling factor in motion transformation, balancing the flexibility and stability of expression deformations.

Landmark-Guided Optimization:

  • Introduction of a landmark-guided loss to refine the learning of implicit keypoints, focusing particularly on subtle facial movements like eye gaze adjustments.

Cascaded Loss Terms:

  • Implementation of multi-region perceptual and GAN losses, alongside a face-id loss and the landmark-guided loss to improve both identity preservation and animation quality.

Stitching and Retargeting

The framework includes sophisticated modules for stitching and retargeting that allow for enhanced controllability with minimal computational overhead:

Stitching Module:

  • The stitching module mitigates pixel misalignment, enabling accurate reconstruction of the animated region onto the original image space.

Eyes and Lip Retargeting:

  • Two MLP-based modules allow controlling the extent of eye and lip movements independently, promoting realistic and expressive animations.

Experimental Results

Self-Reenactment:

  • The model exhibits superior performance in self-reenactment tasks, preserving appearance details and effectively transferring facial motions.

Cross-Reenactment:

  • In cross-reenactment scenarios, LivePortrait demonstrates commendable capabilities in maintaining identity and transferring subtle facial expressions, outperforming existing diffusion-based models in efficiency and, in some cases, quality metrics.

Quantitative Metrics:

  • The paper details extensive quantitative evaluations where LivePortrait excels across multiple benchmarks, including PSNR, SSIM, LPIPS, FID, AED, and APD.

Implications and Future Work

The practical implications of this work are vast, potentially advancing applications in video conferencing, social media, and entertainment. By achieving real-time performance on a high-end GPU, LivePortrait sets the stage for accessible and efficient portrait animation.

However, the paper acknowledges limitations in handling large pose variations and anticipates further research to improve stability under significant motion conditions.

Conclusions

In summary, "LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control" provides a substantial advancement in portrait animation technology. By innovatively combining implicit-keypoint representations, scalable training practices, and advanced control mechanisms, the authors set a new benchmark for efficiency and quality in portrait animation systems. The research opens avenues for real-time, high-fidelity animation in a variety of practical applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube