HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation (2401.07727v1)

Published 15 Jan 2024 in cs.CV

Abstract: Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a novel approach that leverages 2D diffusion models to generate 3D assets from text in just 7 seconds.
It employs a two-stage method combining triplanar mesh learning with feedforward inference to bypass per-sample optimization.
Empirical results demonstrate superior speed, quality, and diversity compared to existing text-to-3D models, enabling rapid content creation.

Introduction to HexaGen3D

In the domain of 3D asset generation, efficiency and quality are paramount. Traditional approaches have been marred by long generation times and a lack of data. HexaGen3D offers a solution to these challenges by adapting pre-existing 2D diffusion models for the task of creating 3D objects from textual prompts.

Overcoming Data Scarcity and Enhancing Speed

HexaGen3D capitalizes on the advanced capabilities of 2D diffusion models. Its innovative technique finetunes these models to predict 3D orthographic projections and their corresponding latent triplanar mesh representation. Unlike many current methods that require exhaustive optimizations for each individual sample, HexaGen3D can swiftly infer high-quality, varied 3D objects from textual prompts within just 7 seconds, striking a superior quality-to-latency balance compared to existing techniques.

Technique and Methodology

HexaGen3D's process includes two pivotal stages: learning a triplanar representation of a textured mesh using a variational autoencoder (VAE) and finetuning a pre-trained text-to-image model to synthesize new triplanar latents. A central feature of this approach is the use of "Orthographic Hexaview guidance"—an intermediary task that involves predicting six-sided orthographic projections to bridge the gap between 2D image synthesis and 3D reasoning. During testing, HexaGen3D eschews per-sample optimization, favoring feedforward generation to produce a 3D textured mesh, which is subsequently enhanced through a UV texture baking procedure. This post-processing step leverages the detailed hexaview predictions, leading to refined visual quality in the final output.

Comparisons and Results

HexaGen3D has been empirically tested and compared with other state-of-the-art text-to-3D models, including DreamFusion and MVDream, all within a standardized evaluation framework. The results are telling: HexaGen3D not only offers qualitative improvements but is also significantly faster, invoking a greater degree of object diversity across various prompts. Ablations further confirm the crucial parts of the process: hexaview baking for visual improvement and multi-view prediction for robust generation. HexaGen3D stands as an exemplar shift in 3D generation methodology, showcasing a marked increase in speed and efficiency without compromising on the quality of the assets generated.

Looking Ahead

The development of HexaGen3D marks a considerable forward leap. Its ability to quickly generate diverse and high-fidelity 3D objects from textual prompts harnesses the previously untapped potential of 2D generative models for 3D content creation. Future enhancements could improve mesh quality and further explore the profound implications of such technology across various domains, including gaming, virtual reality, and design. The rapid and diverse text-to-3D asset generation capability of HexaGen3D positions Qualcomm AI Research's latest work as a significant innovation in the landscape of 3D content creation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1747501047705194655

https://twitter.com/Nymarius_/status/1747509886101938637

https://twitter.com/WilliamLamkin/status/1748154628246536275

https://twitter.com/javaeeeee1/status/1748711317899259916