Emergent Mind

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

(2403.17694)
Published Mar 26, 2024 in cs.CV , cs.GR , and eess.IV

Abstract

In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment. We release code and model weights at https://github.com/scutzzj/AniPortrait

Proposed method extracts 3D facial mesh from audio, projects to 2D, and creates portrait video.

Overview

  • AniPortrait is introduced as a novel framework for generating photorealistic portrait animations from audio inputs and static images, aiming to address challenges in visual pleasure and temporal consistency.

  • The framework employs a two-stage approach, starting with audio feature extraction transformed into 2D facial landmarks, followed by the application of a diffusion model integrated with a motion module for creating realistic animations.

  • AniPortrait showcases superior performance in producing animations with natural facial expressions and movements, demonstrating advancements over existing methods in realism and visual quality.

  • The implications of AniPortrait's success suggest promising avenues for facial motion editing and reenactment, with future research directions identified in improving the generation process and exploring direct prediction from audio to video.

Audio-Driven Synthesis of Photorealistic Portrait Animation with AniPortrait

Introduction to AniPortrait

Generating expressive and realistic portrait animations from audio inputs and static images has numerous applications in digital media, virtual reality, and gaming. However, the challenge lies in producing animations that are visually pleasing and maintain temporal consistency. AniPortrait emerges as a novel framework designed to tackle this issue, enabling high-quality animation generation driven by audio inputs alongside a reference portrait image. This framework adopts a two-stage approach: initially focusing on converting audio input into a sequence of 2D facial landmarks, followed by utilizing a robust diffusion model integrated with a motion module to translate these landmarks into photorealistic and temporally consistent animated portraits.

Technical Approach

The AniPortrait methodology is crafted with precision, starting from audio feature extraction and transformation into 3D facial meshes and head poses. This is achieved through transformer-based models, subsequently projecting these 3D representations into 2D facial landmarks. These landmarks capture the intricacies intended for the final animation—ranging from subtle expressions to synchronized head movements with the audio's rhythm.

The second stage leverages a diffusion model, specifically engineered with a motion module to ensure the generation of fluid and life-like animated portraits from the processed landmarks. Notably, modifications in the network architecture—inspired by prior works—have enhanced the capability of producing realistic lip movements, a pivotal aspect often challenging in animation. The framework's innovative use of 3D intermediate representations not only improves flexibility and controllability but also significantly broadens applicability in areas such as facial motion editing and face reenactment.

Experimental Success

AniPortrait demonstrates superior capabilities in generating animations that exhibit natural facial expressions, diverse poses, and superior visual quality. The framework's experimental results validate its potential to produce animations surpassing existing methods in terms of realism and visual appeal. Moreover, the utilization of diffusion models contributes noteworthy advancements in the quality of generated content, particularly in achieving photorealistic effects and temporal consistency in portrait animations.

Implications and Future Directions

The practical implications of AniPortrait are profound, providing promising avenues for development in facial motion editing and reenactment fields. Its success opens up new possibilities for improving virtual interaction and engagement across various digital platforms. Theoretically, this research enriches the understanding and application of diffusion models in generating dynamic visual content from static images and audio inputs.

Looking ahead, the methodology's reliance on intermediate 3D representations highlights an area ripe for future exploration. The acquisition of large-scale, high-quality 3D data remains a significant challenge, potentially limiting the range of expressions and postures achievable within the animations. Future efforts could explore direct prediction methods from audio to video, aiming to bypass limitations related to 3D data acquisition and further push the boundaries of animation realism.

In conclusion, AniPortrait sets a new benchmark in the field of portrait animation, uniquely combining the strengths of audio-driven synthesis and advanced diffusion models. As the community continues to explore and refine these techniques, the potential for creating even more lifelike and expressive animated content appears boundless.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube