Emergent Mind

Make-A-Shape: a Ten-Million-scale 3D Shape Model

(2401.11067)
Published Jan 20, 2024 in cs.CV and cs.GR

Abstract

Significant progress has been made in training large generative models for natural language and images. Yet, the advancement of 3D generative models is hindered by their substantial resource demands for training, along with inefficient, non-compact, and less expressive representations. This paper introduces Make-A-Shape, a new 3D generative model designed for efficient training on a vast scale, capable of utilizing 10 millions publicly-available shapes. Technical-wise, we first innovate a wavelet-tree representation to compactly encode shapes by formulating the subband coefficient filtering scheme to efficiently exploit coefficient relations. We then make the representation generatable by a diffusion model by devising the subband coefficients packing scheme to layout the representation in a low-resolution grid. Further, we derive the subband adaptive training strategy to train our model to effectively learn to generate coarse and detail wavelet coefficients. Last, we extend our framework to be controlled by additional input conditions to enable it to generate shapes from assorted modalities, e.g., single/multi-view images, point clouds, and low-resolution voxels. In our extensive set of experiments, we demonstrate various applications, such as unconditional generation, shape completion, and conditional generation on a wide range of modalities. Our approach not only surpasses the state of the art in delivering high-quality results but also efficiently generates shapes within a few seconds, often achieving this in just 2 seconds for most conditions.

Overview

  • Make-A-Shape introduces an innovative wavelet-tree representation for large-scale 3D model training, effectively handling over ten million shapes.

  • The framework uses wavelet decomposition to retain both coarse and detail subband coefficients, enabling nearly lossless encoding of 3D shapes.

  • Advances in training efficiency through a diffusion model and a subband adaptive strategy ensure capture of full shape details.

  • Make-A-Shape can perform conditional generation from various inputs, facilitating practical applications with diverse requirements.

  • Extensive experiments show the model's superior performance in generation tasks and its potential for zero-shot shape completion.

Introduction

In the pursuit of more advanced 3D generative models, there remains a gap in representation efficacy and training efficiency on large datasets. To bridge this, the introduced Make-A-Shape framework is pivotal. Offering a comprehensive approach for efficient large-scale 3D model training, this framework proficiently handles over ten million shapes, showcasing a leap forward in addressing the prevalent issues in 3D generative modeling.

The Wavelet-Tree Representation

Make-A-Shape innovates with the wavelet-tree representation, adopting a wavelet decomposition on a high-resolution SDF grid. This yields a representation that retains coarse and detail subband coefficients, marrying expressiveness with compactness—a vital advantage in streaming and training on extensive 3D shape datasets. By harnessing these coefficients rather than discarding high-frequency details for learning efficiency, the representation nearly losslessly encodes 3D shapes. This stands in contrast to prior models that tend to lose detail for efficiency.

Efficient Training with the Diffusion Model

The model transcends the limitations of inefficient learning by packing wavelet-tree coefficients into a diffusible grid layout, amenable for a diffusion-based generative model. A subband adaptive training strategy ensures the model captures the full spectrum of shape details, from coarse structure to fine textures, sans the collapse or ineffectual learning that could arise from a naive Mean Squared Error application.

Conditional Generation Capability

Make-A-Shape also extends its utility to conditional generation, handling a variety of inputs. Different modalities, including single/multi-view images, point clouds, and low-resolution voxels, are accommodated by converting conditions into latent vectors, followed by employing these vectors in the generative network. This modular approach enables the framework to adapt to diverse inputs effortlessly, a characteristic that positions it for practical applications where conditions might differ significantly.

Experiments and Results

The model's proficiency is evidenced by extensive experimental validation. It generates conditions cognizant 3D shapes, outperforming the state-of-the-art, particularly with image inputs, where it demonstrates superior capability in rendering the visible parts of objects while presenting credible variations for the unseen segments. The framework also shows adaptability, swiftly adjusting to point cloud density variations and voxel resolutions without sacrificing quality.

Importantly, the framework paves the way for tasks beyond generation, such as zero-shot shape completion, where it can inventively fill gaps in partial inputs. This versatility extends the utility of Make-A-Shape into domains where object restoration or extrapolation is essential.

Conclusions and Future Directions

Make-A-Shape heralds a new era in large-scale 3D shape modeling, providing a route to train generative models that can synthesize superior quality outputs rapidly. One limitation, however, is the model's inclination towards certain object categories due to training data imbalance. Additionally, the current focus is solely on geometry without considerations for texture. Future works could aim to mitigate these limitations by exploring category annotations and introducing texture to the generative process. The promise that Make-A-Shape holds for the advancement of 3D content creation, simulation, and possibly even virtual reality and gaming, is substantial and exciting.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.