3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors (2403.02234v2)

Published 4 Mar 2024 in cs.CV

Abstract: We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors. The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototyping. The second stage utilizes 2D diffusion priors to further refine the texture of coarse 3D models from the first stage. The refinement consists of both latent and pixel space optimization for high-quality texture generation. To facilitate the training of the proposed system, we clean and caption the largest open-source 3D dataset, Objaverse, by combining the power of vision LLMs and LLMs. Experiment results are reported qualitatively and quantitatively to show the performance of the proposed system. Our codes and models are available at https://github.com/3DTopia/3DTopia

References (4)

Citations (25)

View on Semantic Scholar

Summary

The paper presents a novel two-stage system that generates coarse 3D models using a tri-plane latent diffusion model, followed by texture refinement with 2D diffusion priors.
It leverages a hybrid approach combining feed-forward networks with optimization-based refinement to enhance both speed and fidelity of 3D outputs.
Validated on the Objaverse dataset, the method outperforms existing techniques by producing detailed 3D assets that closely match textual descriptions.

Unveiling 3DTopia: A Novel Approach for Text-to-3D Generation Using Hybrid Diffusion Priors

Introduction

In the burgeoning field of 3D asset generation, the quest to seamlessly transform textual descriptions into detailed 3D models remains a challenging yet highly desirable goal. The paper introduces a novel system, 3DTopia, which addresses this challenge through a two-stage text-to-3D generation framework, significantly enhancing the quality and efficiency of 3D model production from textual inputs. Leveraging hybrid diffusion priors and capitalizing on the largest open-source 3D dataset, Objaverse, 3DTopia emerges as a potent tool for generating general 3D assets with high fidelity and in expedited timelines.

3DTopia: A System Overview

3DTopia innovatively combines feed-forward networks and optimization-based refinement to achieve rapid prototyping and high-quality 3D output. The system is segmented into two critical stages: the rapid generation of coarse 3D samples through a text-conditioned tri-plane latent diffusion model and the subsequent refinement of these samples for enhanced texture details.

Stage 1: Coarse 3D Generation

The initial stage capitalizes on a tri-plane latent diffusion model, directly learning from 3D data to generate coarse but rapidly produced 3D samples. This choice of representation strikes a balance between storage efficiency and computational friendliness, making it an ideal candidate for large-scale dataset applications.

Stage 2: Texture Refinement

Building on the base generated by the first stage, the second phase of 3DTopia employs a novel approach for texture refinement. By utilizing 2D diffusion priors within an optimization-based framework, the system adeptly refines the textures of the initial coarse 3D models. This dual-phase approach encapsulates the strengths of both worlds—rapid prototyping and meticulous detail enhancement.

Dataset Preparation and Utilization

A critical component of training such an advanced system is the availability of high-quality training data. 3DTopia addresses this by employing a novel pipeline for 3D data captioning and cleaning, presenting a refined subset of the Objaverse dataset. With over 360K captions that are detailed and reflective of the 3D objects, the system ensures improved training efficiency and output quality.

Experimental Results

The paper reports extensive qualitative and quantitative analyses validating the efficacy of the proposed system. Notably, 3DTopia demonstrates a superior ability to generate detailed 3D assets rapidly, outperforming existing methodologies such as Point-E and Shap-E in terms of quality and fidelity to textual descriptions.

Implications and Future Directions

3DTopia's introduction marks a significant milestone in the text-to-3D generation domain. By effectively marrying rapid prototyping capabilities with high-quality output generation, the system opens up new avenues for applications across various industries, including gaming, virtual reality, and visual effects.

Looking ahead, the scalability and efficiency of 3DTopia hint at its potential for further advancements. Enhanced training on more diverse datasets, integration with more powerful 2D diffusion models, and exploration into generating more complex 3D scenes from elaborate textual descriptions are potential areas for future research and development.

Conclusion

3DTopia represents a significant leap forward in the text-to-3D generation domain. Its innovative two-stage approach, leveraging hybrid diffusion priors, sets new benchmarks in the quality and efficiency of generating 3D assets from natural language inputs. As the system continues to evolve, it promises to unlock new possibilities in the creation and application of 3D content across various domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/EHuanglu/status/1854185357333741872

https://twitter.com/liuziwei7/status/1766842923184722118