Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model (2304.08577v1)

Published 17 Apr 2023 in cs.CV

Abstract: With the recent surge in popularity of AR/VR applications, realistic and accurate control of 3D full-body avatars has become a highly demanded feature. A particular challenge is that only a sparse tracking signal is available from standalone HMDs (Head Mounted Devices), often limited to tracking the user's head and wrists. While this signal is resourceful for reconstructing the upper body motion, the lower body is not tracked and must be synthesized from the limited information provided by the upper body joints. In this paper, we present AGRoL, a novel conditional diffusion model specifically designed to track full bodies given sparse upper-body tracking signals. Our model is based on a simple multi-layer perceptron (MLP) architecture and a novel conditioning scheme for motion data. It can predict accurate and smooth full-body motion, particularly the challenging lower body movement. Unlike common diffusion architectures, our compact architecture can run in real-time, making it suitable for online body-tracking applications. We train and evaluate our model on AMASS motion capture dataset, and demonstrate that our approach outperforms state-of-the-art methods in generated motion accuracy and smoothness. We further justify our design choices through extensive experiments and ablation studies.

Citations (59)

View on Semantic Scholar

Summary

The paper introduces AGRoL, a novel diffusion-based model that accurately synthesizes full-body motion from minimal tracking inputs.
It employs a multi-layer perceptron backbone with time step embedding to ensure smooth, temporally coherent movements and minimize jitter artifacts.
Extensive experiments on the AMASS dataset show significant reductions in MPJPE and MPJRE, demonstrating AGRoL’s potential for real-time VR/AR applications.

Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model

The paper addresses a significant challenge in the field of augmented reality (AR) and virtual reality (VR): the full-body motion synthesis of 3D avatars from sparse tracking data. This issue is particularly prominent when only limited tracking signals from head-mounted displays (HMDs) and controllers are available, typically providing data for the user's head and wrists. The lack of complete lower-body tracking complicates the avatar control, necessitating robust synthesis methods to infer the entire body position and movement. The authors introduce AGRoL, short for "Avatars Grow Legs," a novel diffusion-based model that leverages minimal upper-body tracking signals to predict comprehensive, smooth human motions.

Key insights into AGRoL's development highlight the choice of a diffusion model architecture, which has shown promise in other generative tasks such as image synthesis. The authors explored various neural network architectures, ultimately selecting a multi-layer perceptron (MLP) as the backbone due to its simplicity and effectiveness. This MLP-based setup is complemented by a novel strategy for injecting time step embedding to preserve temporal coherence and mitigate jitter artifacts traditionally associated with neural predictions of motion data.

Extensive experimentation was conducted using the AMASS motion capture dataset, showcasing AGRoL's superiority in capturing realistic, smooth, synthesized body movements compared to state-of-the-art methods. The numerical results demonstrated AGRoL's achievement in minimizing Mean Per Joint Position Error (MPJPE) and Mean Per Joint Rotation Error (MPJRE), alongside achieving lower jitter and velocity errors, pointing to its capability for real-time applications.

The implications of this research are multifaceted. Practically speaking, it offers a more seamless and realistic user experience in VR environments, as users engage through avatars with fluid, accurate bodily interactions. From a theoretical standpoint, the success of AGRoL validates the potential of diffusion models in real-time motion synthesis, extending their application beyond static data generation and decoding to dynamic and temporally-sensitive tasks. Additionally, the model's robust performance, even with missing data, underscores adaptability in various real-world applications where tracking signals may be inconsistent.

Several forward-looking aspects are evident, including exploring the application of AGRoL in more diverse and complex motion datasets such as dance or sports, which may include rapid movements and non-standard poses. Integration with physical interaction constraints or reinforcement learning could enhance the realism and responsiveness of the synthesized movements to environmental cues. Furthermore, the generalization of the approach to other modalities such as real-time game character control presents a fruitful expansion avenue.

In conclusion, AGRoL is a notable advancement in the domain of motion synthesis, combining state-of-the-art diffusion processes with a focus on efficiency and accuracy. It stands as a testament to how novel architectures, when effectively deployed, can substantially elevate the quality and realism of synthesized avatar interactions, further bridging the gap between human users and virtual experiences.