AToM: Amortized Text-to-Mesh using 2D Diffusion (2402.00867v1)

Published 1 Feb 2024 in cs.CV

Abstract: We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh framework optimized across multiple text prompts simultaneously. In contrast to existing text-to-3D methods that often entail time-consuming per-prompt optimization and commonly output representations other than polygonal meshes, AToM directly generates high-quality textured meshes in less than 1 second with around 10 times reduction in the training cost, and generalizes to unseen prompts. Our key idea is a novel triplane-based text-to-mesh architecture with a two-stage amortized optimization strategy that ensures stable training and enables scalability. Through extensive experiments on various prompt benchmarks, AToM significantly outperforms state-of-the-art amortized approaches with over 4 times higher accuracy (in DF415 dataset) and produces more distinguishable and higher-quality 3D outputs. AToM demonstrates strong generalizability, offering finegrained 3D assets for unseen interpolated prompts without further optimization during inference, unlike per-prompt solutions.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces an amortized optimization strategy that reduces training time while improving the quality of text-to-mesh generation.
It employs a novel triplane-based architecture and a two-stage process combining volumetric rendering with high-resolution mesh refinement.
Empirical results show superior generalization, with over four times accuracy on DF415 and mesh inference in under one second.

An Overview of AToM: Amortized Text-to-Mesh using 2D Diffusion

The paper entitled "AToM: Amortized Text-to-Mesh using 2D Diffusion" introduces an innovative framework for text-to-3D content generation, specifically focusing on converting textual inputs into high-quality polygonal meshes. The presented approach, named AToM, offers a notable advancement in how computational resources are utilized during the training phase of text-to-3D models, a field which has historically been burdened by intensive and prompt-specific training processes.

Methodology and Architecture

AToM differentiates itself from conventional models through its employment of an amortized optimization strategy. Unlike traditional methodologies that necessitate separate training for each text prompt, AToM effectively learns across multiple prompts simultaneously. This is primarily facilitated by a novel triplane-based architecture, which replaces the more commonly used HyperNetworks that condition positional encoding. This architectural choice not only enhances numerical stability but also contributes to improved render quality and definition of generated 3D structures.

The framework is structured around a two-stage optimization process. Initially, a volumetric rendering approach is applied, utilizing a NeRF (Neural Radiance Fields) scheme to create a coarse 3D model. Subsequently, the model undergoes refinement via a high-resolution mesh optimization stage. This two-tier approach allows for significant reductions in training time without compromising the quality and distinction of the final 3D output.

Numerical Results and Performance

Empirical data presented within the paper highlight AToM's effectiveness, particularly when benchmarked against existing state-of-the-art models. The model demonstrates a superior generalizability to unseen text prompts, achieving over four times the accuracy on the DF415 dataset compared to other amortized approaches like ATT3D. Key quantitative measures, such as the CLIP R-probability, provide evidence of AToM's reliable performance across a range of datasets, notably excelling in scalability and speed, with mesh outputs being generated in under one second during inference.

Practical and Theoretical Implications

The implications of AToM extend into both practical and theoretical realms. Practically, AToM could revolutionize industries reliant on rapid and high-quality 3D content generation, such as gaming, digital content creation, and virtual reality, by drastically reducing the computational burden associated with model training and inference. Theoretically, the framework opens new avenues for research into more efficient and generalized 3D model training techniques.

Future Directions

Potential developments stemming from this research could involve enhancing the fidelity of outputs by integrating higher-resolution diffusion priors. Additional research might also focus on refining the mesh representation to better handle surfaces with nonzero genus or exploring methods to alleviate instances of Janus problem occurrences within the generated models.

In summation, the proposed AToM framework represents a significant step forward in text-to-mesh generation, offering both reduced computational demands and superior generalization capabilities. The outcomes of this research hold substantial promise for advancing the efficiency and applicability of generative AI in creating complex 3D models from textual data inputs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1753280514281070812

https://twitter.com/guocheng_qian/status/1754055638273904728

https://twitter.com/Gradio/status/1753410202320482682

https://twitter.com/kashifcreations/status/1753395241917485424

https://twitter.com/gm8xx8/status/1753237739359686792

https://twitter.com/knishimae0531/status/1753603365945397746