GALA: Generating Animatable Layered Assets from a Single Scan (2401.12979v1)

Published 23 Jan 2024 in cs.CV

Abstract: We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.

References (96)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces GALA, a novel framework that automatically decomposes single-layer 3D scans into layered, animatable assets using diffusion models and 3D segmentation.
The framework leverages a pose-guided Score Distillation Sampling loss and multi-view segmentation to generate high-fidelity geometry and textures, even for occluded regions.
The proposed method outperforms existing techniques with refined optimization strategies, enabling realistic and flexible avatar customization in virtual environments.

Introduction

In online environments, virtual try-on and avatar customization represent significant areas of interest. 3D models can now be constructed with relative ease, but the models generated are often static and limited in terms of animation and customizability. Traditionally, the creation of animatable and layerable 3D assets has been a manual and time-consuming endeavor. Addressing this challenge, we introduce GALA—Generating Animatable Layered Assets from a Single Scan. GALA provides a novel framework enabling the automatic transformation of single-layer 3D human scans into animatable, multi-layered 3D assets.

Approach

GALA's approach encompasses a multi-faceted process to deconstruct a single-layered mesh, typically derived from clothed human 3D scans, into versatile layered assets. Its advantage lies in the generation of full decomposed assets, including occluded regions, which are critical for seamless recomposition and animation. The framework heavily relies on a 2D diffusion model, tapping into the vast image corpus it has been trained on to complement missing geometry and textures. By separating input meshes via a 3D surface segmentation extracted from multi-view 2D segmentations and employing a pose-guided Score Distillation Sampling (SDS) loss, GALA reconstructs high-fidelity geometry and texture in both posed and canonical spaces.

Method

GALA builds on the effective use of a diffusion model and a geometric representation known as Deep Marching Tetrahedra (DMTet). It begins with geometric decomposition, relying on 3D surface segmentation and the pose-guided SDS loss to model these layers in a canonical space that supports reanimation. Then, GALA takes on texture generation, utilizing the pose-guided SDS loss in canonical space to create textures for occluded regions. The composition of the assets includes a refinement step to minimize penetration between layers, which optimizes vertex positions and penalizes misalignments. These steps result in 3D models that can be rigged, posed, and layered effectively.

Evaluation

The GALA framework's capabilities are tested through a series of experiments and benchmarking against existing solutions, demonstrating superior performance in decomposition, canonicalization, and composition tasks. With refined optimization strategies, GALA showcases its ability to maintain geometry and texture integrity even when facing complex occluded regions. These assets can be matched with various identities and animated in a range of poses while retaining a high level of realism.

Conclusion

GALA is uniquely positioned to advance how avatars and digital apparel are generated, serving as a significant step towards automated, high-fidelity 3D model creation. The flexibility and quality of the output assets hold promise for various applications, inspiring future research paths to explore pose-dependent deformations and further independence from 2D segmentations. The comprehensive benchmark and future release of the codebase make GALA a pivotal resource for researchers seeking to innovate in virtual representations.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1750010373871169717

https://twitter.com/gm8xx8/status/1749976279850344740