Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion (2405.09874v1)

Published 16 May 2024 in cs.CV

Abstract: We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at https://dual3d.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. arXiv preprint arXiv:2311.17984, 2023.
  3. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5799–5809, 2021.
  4. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16123–16133, 2022.
  5. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pp.  333–350. Springer, 2022.
  6. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13142–13153, 2023.
  7. Gram: Generative radiance manifolds for 3d-aware image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10673–10683, 2022.
  8. Generative adversarial networks. COMMUNICATIONS OF THE ACM, 63(11), 2020.
  9. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  10. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
  11. Streetsurf: Extending multi-view implicit surface reconstruction to street views. arXiv preprint arXiv:2306.04988, 2023.
  12. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  13. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=0RDcd5Axok.
  14. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  15. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  16. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  17. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  867–876, 2022.
  18. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  19. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb.
  20. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4401–4410, 2019.
  21. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8110–8119, 2020.
  22. Alias-free generative adversarial networks. In Proc. NeurIPS, 2021.
  23. Noise-free score distillation. arXiv preprint arXiv:2310.17590, 2023.
  24. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023.
  25. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  26. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 2014.
  27. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020.
  28. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  300–309, 2023.
  29. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. arXiv preprint arXiv:2312.16256, 2023.
  30. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
  31. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9298–9309, 2023a.
  32. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  61–68, Dublin, Ireland, May 2022. URL https://aclanthology.org/2022.acl-short.8.
  33. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
  34. Unidream: Unifying diffusion priors for relightable text-to-3d generation, 2023c.
  35. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  36. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  37. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  38. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  39. Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13503–13513, 2022.
  40. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  41. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  42. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  43. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  44. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10901–10911, 2021.
  45. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  46. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  47. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  48. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  49. Graf: Generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems, 33:20154–20166, 2020.
  50. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
  51. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20875–20886, 2023.
  52. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021.
  53. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
  54. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023a.
  55. Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder, 2023b.
  56. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12619–12629, 2023a.
  59. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In 35th Conference on Neural Information Processing Systems, pp.  27171–27183. Curran Assoicates, Inc., 2021.
  60. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  61. Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  3202–3211, 2024.
  62. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp.  3–19, 2018.
  63. Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2195–2205, 2023.
  64. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023.
  65. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
  66. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9150–9161, 2023.
  67. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
  68. Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. arXiv preprint arXiv:2308.14078, 2023.
Citations (5)

Summary

  • The paper introduces a dual-mode multi-view latent diffusion method that integrates efficient 2D and 3D denoising to rapidly generate consistent 3D models.
  • The methodology leverages pretrained 2D latent diffusion models and toggles between 2D and 3D modes, reducing inference time to approximately 50 seconds.
  • The approach achieves high semantic accuracy and aesthetic quality, with strong evaluation metrics and significant potential for applications in gaming, VR/AR, and robotics.

Understanding Dual3D: Efficient Text-to-3D Generation with Dual-Mode Latent Diffusion

Overview

This article breaks down the paper titled "Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion." This research provides an efficient method to convert textual descriptions into high-quality 3D models using a novel approach leveraging dual-mode latent diffusion models. Below, we'll dive into the details and implications of this fascinating work.

Key Components

Dual-Mode Multi-view Latent Diffusion Model

The core innovation in this paper is the dual-mode multi-view latent diffusion model. Here's how it works:

  1. Pretrained 2D Latent Diffusion Models (LDMs): The model starts with a pretrained 2D LDM, which is then fine-tuned for 3D purposes. This significantly reduces the training cost and leverages the strengths of already effective 2D models.
  2. Dual Modes: The model operates in two modes:
    • 2D Mode: Efficiently denoises noisy multi-view latents with a single latent denoising network.
    • 3D Mode: Generates a tri-plane neural surface for consistent rendering-based denoising. Tri-planes are three images representing different planes of 3D space which help create 3D geometry.
  3. Inference Strategy: The paper proposes a dual-mode toggling inference strategy, which switches between the 2D and 3D modes during inference. By doing this, it achieves high efficiency without compromising on the quality. Specifically, only 1/10th of the denoising steps use the 3D mode, cutting down inference time to just 10 seconds.

Texture Refinement

To enhance the quality of textures in the generated 3D models, a texture refinement process is introduced. This involves:

  • Extracting the neural surface into a mesh.
  • Converting the texture into a learnable texture map.
  • Optimizing this texture map using differentiable rendering and the pretrained 2D LDM.

Numerical Results

The paper boasts impressive results in generating high-quality 3D assets:

  1. CLIP Similarity & R-Precision: These metrics are used to measure the alignment between the generated 3D assets and the textual descriptions. The method shows strong performance in both CLIP Similarity and R-Precision, indicating that the generated assets are semantically accurate and diverse.
  2. Aesthetic Score: The generated 3D models are also evaluated for their aesthetic appeal using the LAION Aesthetic Predictor, where the models receive high scores.
  3. Generation Time: Remarkably, despite the high quality, the method generates models in approximately 50 seconds, a stark contrast to the 3-45 minutes required by other methods like DreamGaussian and MVDream.

User Study

In a user paper involving 24 participants, the method was evaluated across various criteria, confirming the subjective quality of the generated 3D assets. The proposed method consistently scored the highest, aligning well with user preferences.

Implications

Practical Implications

This research can significantly impact industries like gaming, robotics, virtual reality (VR), and augmented reality (AR). For example:

  • Gaming: Generates diverse and high-quality 3D assets quickly, reducing development time and costs.
  • VR/AR: Enhances the realism and detail of virtual objects, improving user experiences.
  • Robotics: Provides accurate and detailed 3D models for simulation and interaction in various environments.

Theoretical Implications

On a theoretical level, this research advances the understanding and application of diffusion models in 3D space. By effectively combining multi-view image data and pre-trained 2D LDMs, it opens new avenues for efficient cross-domain model adaptation and multi-modal learning.

Future Directions

While the paper showcases promising results, there are areas for further exploration:

  1. Handling Complex Text Prompts: The current method struggles with text prompts involving fine-grained or complex concepts. Future research could focus on enhancing the model’s ability to understand and generate intricate multi-object scenes.
  2. Improving Fine Details: Despite the robust texture refinement process, generating extremely detailed or thin shapes remains challenging. Incorporating more advanced 3D representations, like 3D Gaussian Splatting, could further enhance the quality and realism of the generated assets.
  3. Real-world Multi-view Data: Integrating real-world multi-view data could improve the models' ability to generate more realistic and contextually rich 3D objects.

Conclusion

Dual3D introduces an innovative and efficient approach to text-to-3D generation, leveraging the strengths of pretrained 2D LDMs and a dual-mode inference strategy. This method sets a new standard in the field, providing high-quality, semantically accurate 3D models while significantly reducing generation time. As the research progresses, it promises to transform various industries by enabling swift and cost-effective creation of realistic 3D assets.

Youtube Logo Streamline Icon: https://streamlinehq.com