Emergent Mind

Abstract

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

Overview

  • Researchers have developed a framework for creating photorealistic, full-bodied avatars that respond with gestures to audio from conversations.

  • Avatars exhibit a wide range of gestures and facial expressions, using vector quantization and diffusion models to synchronize with spoken dialogue.

  • A unique multi-view dataset enables photorealistic reconstruction and contributes to model effectiveness in gesture generation.

  • The technology enhances virtual interactions and has applications in virtual meetings, online education, and social VR, encouraging further research.

  • Some limitations include challenges in translating long-range conversational content into gestures and a restricted dataset due to privacy concerns.

Overview of Synthesizing Full-Bodied Photorealistic Avatars

In recent research, scientists have developed an innovative framework designed to create full-bodied, photorealistic avatars that gesture in response to the dynamics of a two-way conversation based solely on speech audio. This technology has the potential to improve the realism and expressiveness of digital human avatars, particularly in virtual communication scenarios.

The Science Behind Generating Dynamic Gestures

The methodology behind this breakthrough combines the diversity obtained from vector quantization with the detailed expressions afforded by diffusion models. This allows the avatars to exhibit a wide range of gestures and nuanced facial expressions (like subtle sneers or smirks) that are synchronized with spoken dialogue. The generated motion not only includes the entire body but also the face and hands, captured at a higher frame rate to convey more intricate movements.

To support this area of study, the researchers have introduced a unique dataset, which is the first to offer multi-view conversational footage that enables photorealistic reconstruction. The experimental evaluations underscore the model's effectiveness in generating varied and fitting gestures, which surpass the performance of previous methods.

The Technology and Data

At the heart of this technology are two separate models: one for the face, leveraging an audio-conditioned diffusion model, and another for the body and hands, which uses an innovative combination of an autoregressive VQ-based method and a diffusion model. The personalized avatars are visualized through a neural renderer trained with multi-view capture data.

The researchers also compiled a new dataset to enable these advancements. The dataset consists of long-form dyadic interactions that cover a broad spectrum of emotions and conversational topics. Unlike previous datasets limited to skeletal or cartoon-like visualizations, this dataset can reconstruct individuals with photorealism, capturing the subtleties of real human interactions.

Implications and Applications

This technology has major implications for the future of virtual interaction systems. The ability to generate realistic avatars that respond naturally to audio cues can greatly enhance telepresence in technology such as virtual meetings, online education, and social VR. Additionally, the released dataset and code are set to further research into gesture generation with high-fidelity avatars, paving the way for more natural and immersive virtual experiences.

Reflecting on the Current Limitations

While this new method shows promising results in generating lifelike gestures for short audio segments, it is less adept at synthesizing movements that require a deep understanding of long-range conversational content. Additionally, the study currently focuses on a small group of subjects for which consent has been granted, addressing privacy concerns while limiting the versatility of the avatars created. Despite these limitations, the project sets a new precedent in the development of photorealistic interactive avatars and poses essential questions about the future evaluation of such technology.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.