Emergent Mind

Abstract

We present HAHA - a novel approach for animatable human avatar generation from monocular input videos. The proposed method relies on learning the trade-off between the use of Gaussian splatting and a textured mesh for efficient and high fidelity rendering. We demonstrate its efficiency to animate and render full-body human avatars controlled via the SMPL-X parametric model. Our model learns to apply Gaussian splatting only in areas of the SMPL-X mesh where it is necessary, like hair and out-of-mesh clothing. This results in a minimal number of Gaussians being used to represent the full avatar, and reduced rendering artifacts. This allows us to handle the animation of small body parts such as fingers that are traditionally disregarded. We demonstrate the effectiveness of our approach on two open datasets: SnapshotPeople and X-Humans. Our method demonstrates on par reconstruction quality to the state-of-the-art on SnapshotPeople, while using less than a third of Gaussians. HAHA outperforms previous state-of-the-art on novel poses from X-Humans both quantitatively and qualitatively.

A method enhancing avatar animation by optimizing Gaussians for better photometric quality and efficiency.

Overview

  • The paper introduces a novel method for generating high-quality human avatars using Gaussian splatting combined with a textured mesh prior, aimed at reducing memory requirements.

  • It proposes a unique unsupervised method that significantly lowers the Gaussian count for detailed areas like fingers and faces, enhancing scene representational efficiency.

  • The methodology involves initial Gaussian representations for capturing full-body details, then applying texture mapping on the SMPL-X mesh for near-surface regions, and finally merging these representations to optimize Gaussian usage.

  • Experimental results on datasets demonstrate superior avatar reconstruction with fewer Gaussians, highlighting the method's practical applications in real-time gaming, VR, and film production.

Highly Articulated Gaussian Human Avatars Enhanced with a Textured Mesh Prior

Introduction

The realm of 3D computer vision has seen significant advancements in the generation of human avatars, especially with the proliferation of virtual and augmented reality applications. The creation of photo-realistic and animatable human avatars poses a challenge, largely addressed using multi-view data or complex acquisition systems, which, despite their high quality, suffer from the complexity of data collection. An emerging method that promises to simplify the process, using monocular videos, balances the trade-off between input data complexity and avatar quality by leveraging parametric models. However, the representation of these avatars has oscillated between explicit and implicit geometries, until recently, when Gaussian splatting emerged as a promising alternative. Gaussian splatting enhances temporal consistency and accurately conveys out-of-mesh details better than traditional methods. Nevertheless, it demands a high volume of Gaussians, increasing memory requirements exponentially, particularly when animating detailed body parts like fingers.

Related Work

The traditional approach often struggles to reconstruct elements like loose clothes and hair accurately. Hybrid methods like Deferred Neural Rendering (DNR) approach, while capable of representing more details than RGB texture, tend to exhibit temporal inconsistency and flickering. On the contrary, recent works utilizing 3D Gaussian Splatting (3DGS) for human avatars mark a paradigm shift towards achieving temporally consistent animated renderings with reduced artifacts, albeit at the cost of increased memory usage. Particularly, GaussianAvatar and 3dgs-Avatar make strides in optimizing the representation but still require a substantial number of Gaussians, underscoring the need for a more memory-efficient solution.

Novel Contributions

In addressing these issues, "Highly Articulated Gaussian Human Avatars with Textured Mesh Prior" introduces an innovative approach that minimizes the number of Gaussians used by combining Gaussian splatting with a textured mesh prior, particularly for areas necessitating high detail levels such as fingers and facial expressions. This method significantly diminishes the memory footprint without compromising the quality of the avatar. Specifically, the paper:

  • Proposes merging Gaussian splatting with a textured mesh to enhance rendering efficiency of human avatars, a first in the field.
  • Develops an unsupervised methodology for substantially reducing the Gaussian count by utilizing a textured mesh, streamlining the scene representational efficiency.
  • Demonstrates the efficacy of this combined approach in managing the animation of intricate body parts without resorting to additional engineering, thus pushing the boundaries of what's achievable with current technology.

Methodology

The approach operates in three stages, initially employing Gaussian representations to capture full-body details before transitioning to texture mapping on the SMPL-X mesh to efficiently encode near-surface regions. The final stage merges these representations, optimizing the number of Gaussians needed via unsupervised learning, reducing memory requirements and rendering expenses. This process not only decreases the storage costs by over 2.3 times but also offers an effective balance between high-quality detailing and computational efficiency.

Experimental Results and Analysis

Evaluative comparisons on the SnapshotPeople and X-Humans datasets with state-of-the-art models like GART and GaussianAvatar underscore the superior performance of the proposed method. It achieves comparable, if not superior, reconstruction quality with significantly fewer Gaussians, substantiating its practical viability for creating detailed human avatars from monocular video inputs.

Implications and Future Prospects

This research marks a pivotal advancement in human avatar generation, showcasing the feasibility of combining Gaussian splatting with textured meshes for high-fidelity, memory-efficient avatar rendering. The implications extend beyond academic interest, hinting at potential applications in real-time gaming, VR, and film production. Looking ahead, the methodology illuminates pathways for future research, especially in the nuanced animation of complex human movements and expressions, potentially revolutionizing the digital replication of human interactions in virtual spaces.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube