Emergent Mind

Abstract

We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors. The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototyping. The second stage utilizes 2D diffusion priors to further refine the texture of coarse 3D models from the first stage. The refinement consists of both latent and pixel space optimization for high-quality texture generation. To facilitate the training of the proposed system, we clean and caption the largest open-source 3D dataset, Objaverse, by combining the power of vision language models and LLMs. Experiment results are reported qualitatively and quantitatively to show the performance of the proposed system. Our codes and models are available at https://github.com/3DTopia/3DTopia

3DTopia's text-to-3D generation pipeline stages and results, from initial text prompts to refined outputs.

Overview

  • 3DTopia introduces a two-stage text-to-3D generation framework improving the creation of 3D models from textual descriptions, using a novel system that combines hybrid diffusion priors and the Objaverse dataset.

  • The system generates coarse 3D models rapidly using a text-conditioned tri-plane latent diffusion model, followed by a refinement stage for texture enhancement using 2D diffusion priors.

  • A refined subset of the Objaverse dataset with over 360K captions was utilized, improving training efficiency and output quality.

  • Experimental results show 3DTopia outperforms existing methods like Point-E and Shap-E in generating detailed 3D assets quickly and accurately, promising applications in gaming, virtual reality, and beyond.

Unveiling 3DTopia: A Novel Approach for Text-to-3D Generation Using Hybrid Diffusion Priors

Introduction

In the burgeoning field of 3D asset generation, the quest to seamlessly transform textual descriptions into detailed 3D models remains a challenging yet highly desirable goal. The paper introduces a novel system, 3DTopia, which addresses this challenge through a two-stage text-to-3D generation framework, significantly enhancing the quality and efficiency of 3D model production from textual inputs. Leveraging hybrid diffusion priors and capitalizing on the largest open-source 3D dataset, Objaverse, 3DTopia emerges as a potent tool for generating general 3D assets with high fidelity and in expedited timelines.

3DTopia: A System Overview

3DTopia innovatively combines feed-forward networks and optimization-based refinement to achieve rapid prototyping and high-quality 3D output. The system is segmented into two critical stages: the rapid generation of coarse 3D samples through a text-conditioned tri-plane latent diffusion model and the subsequent refinement of these samples for enhanced texture details.

Stage 1: Coarse 3D Generation

The initial stage capitalizes on a tri-plane latent diffusion model, directly learning from 3D data to generate coarse but rapidly produced 3D samples. This choice of representation strikes a balance between storage efficiency and computational friendliness, making it an ideal candidate for large-scale dataset applications.

Stage 2: Texture Refinement

Building on the base generated by the first stage, the second phase of 3DTopia employs a novel approach for texture refinement. By utilizing 2D diffusion priors within an optimization-based framework, the system adeptly refines the textures of the initial coarse 3D models. This dual-phase approach encapsulates the strengths of both worlds—rapid prototyping and meticulous detail enhancement.

Dataset Preparation and Utilization

A critical component of training such an advanced system is the availability of high-quality training data. 3DTopia addresses this by employing a novel pipeline for 3D data captioning and cleaning, presenting a refined subset of the Objaverse dataset. With over 360K captions that are detailed and reflective of the 3D objects, the system ensures improved training efficiency and output quality.

Experimental Results

The paper reports extensive qualitative and quantitative analyses validating the efficacy of the proposed system. Notably, 3DTopia demonstrates a superior ability to generate detailed 3D assets rapidly, outperforming existing methodologies such as Point-E and Shap-E in terms of quality and fidelity to textual descriptions.

Implications and Future Directions

3DTopia's introduction marks a significant milestone in the text-to-3D generation domain. By effectively marrying rapid prototyping capabilities with high-quality output generation, the system opens up new avenues for applications across various industries, including gaming, virtual reality, and visual effects.

Looking ahead, the scalability and efficiency of 3DTopia hint at its potential for further advancements. Enhanced training on more diverse datasets, integration with more powerful 2D diffusion models, and exploration into generating more complex 3D scenes from elaborate textual descriptions are potential areas for future research and development.

Conclusion

3DTopia represents a significant leap forward in the text-to-3D generation domain. Its innovative two-stage approach, leveraging hybrid diffusion priors, sets new benchmarks in the quality and efficiency of generating 3D assets from natural language inputs. As the system continues to evolve, it promises to unlock new possibilities in the creation and application of 3D content across various domains.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.