Emergent Mind

Abstract

Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.

Overview of HexaGen3D, a tool for generating 3D hexahedral mesh.

Overview

  • HexaGen3D is a method that adapts 2D diffusion models to efficiently create 3D objects from text.

  • It uses a finetuned variational autoencoder to predict triplanar meshes and requires no per-sample optimization.

  • The process includes generating orthographic hexaviews and enhancing output through UV texture baking.

  • Empirical testing showcases HexaGen3D's qualitative improvements and speed, outperforming similar methods.

  • The research opens up new avenues for applying 2D generative models in 3D content creation for various industries.

Introduction to HexaGen3D

In the domain of 3D asset generation, efficiency and quality are paramount. Traditional approaches have been marred by long generation times and a lack of data. HexaGen3D offers a solution to these challenges by adapting pre-existing 2D diffusion models for the task of creating 3D objects from textual prompts.

Overcoming Data Scarcity and Enhancing Speed

HexaGen3D capitalizes on the advanced capabilities of 2D diffusion models. Its innovative technique finetunes these models to predict 3D orthographic projections and their corresponding latent triplanar mesh representation. Unlike many current methods that require exhaustive optimizations for each individual sample, HexaGen3D can swiftly infer high-quality, varied 3D objects from textual prompts within just 7 seconds, striking a superior quality-to-latency balance compared to existing techniques.

Technique and Methodology

HexaGen3D's process includes two pivotal stages: learning a triplanar representation of a textured mesh using a variational autoencoder (VAE) and finetuning a pre-trained text-to-image model to synthesize new triplanar latents. A central feature of this approach is the use of "Orthographic Hexaview guidance"—an intermediary task that involves predicting six-sided orthographic projections to bridge the gap between 2D image synthesis and 3D reasoning. During testing, HexaGen3D eschews per-sample optimization, favoring feedforward generation to produce a 3D textured mesh, which is subsequently enhanced through a UV texture baking procedure. This post-processing step leverages the detailed hexaview predictions, leading to refined visual quality in the final output.

Comparisons and Results

HexaGen3D has been empirically tested and compared with other state-of-the-art text-to-3D models, including DreamFusion and MVDream, all within a standardized evaluation framework. The results are telling: HexaGen3D not only offers qualitative improvements but is also significantly faster, invoking a greater degree of object diversity across various prompts. Ablations further confirm the crucial parts of the process: hexaview baking for visual improvement and multi-view prediction for robust generation. HexaGen3D stands as an exemplar shift in 3D generation methodology, showcasing a marked increase in speed and efficiency without compromising on the quality of the assets generated.

Looking Ahead

The development of HexaGen3D marks a considerable forward leap. Its ability to quickly generate diverse and high-fidelity 3D objects from textual prompts harnesses the previously untapped potential of 2D generative models for 3D content creation. Future enhancements could improve mesh quality and further explore the profound implications of such technology across various domains, including gaming, virtual reality, and design. The rapid and diverse text-to-3D asset generation capability of HexaGen3D positions Qualcomm AI Research's latest work as a significant innovation in the landscape of 3D content creation.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.