Emergent Mind

Compress3D: a Compressed Latent Space for 3D Generation from a Single Image

(2403.13524)
Published Mar 20, 2024 in cs.CV and cs.AI

Abstract

3D generation has witnessed significant advancements, yet efficiently producing high-quality 3D assets from a single image remains challenging. In this paper, we present a triplane autoencoder, which encodes 3D models into a compact triplane latent space to effectively compress both the 3D geometry and texture information. Within the autoencoder framework, we introduce a 3D-aware cross-attention mechanism, which utilizes low-resolution latent representations to query features from a high-resolution 3D feature volume, thereby enhancing the representation capacity of the latent space. Subsequently, we train a diffusion model on this refined latent space. In contrast to solely relying on image embedding for 3D generation, our proposed method advocates for the simultaneous utilization of both image embedding and shape embedding as conditions. Specifically, the shape embedding is estimated via a diffusion prior model conditioned on the image embedding. Through comprehensive experiments, we demonstrate that our method outperforms state-of-the-art algorithms, achieving superior performance while requiring less training data and time. Our approach enables the generation of high-quality 3D assets in merely 7 seconds on a single A100 GPU.

Compress3D's three components: Triplane AutoEncoder, Triplane Diffusion Model, and Diffusion Prior Model.

Overview

  • Compress3D presents an efficient method for generating high-quality 3D models from single images, using a novel triplane autoencoder architecture.

  • The system enhances generation fidelity by employing a dual-conditioning strategy with image and shape embeddings.

  • Experimental results show Compress3D outperforms existing methods in speed, quality, and efficiency on standard benchmarks.

  • Compress3D's approach promises broader accessibility to advanced 3D modeling and opens new avenues for research in 3D content generation.

Efficient 3D Model Generation from Single Images with Compress3D

Introduction to Compress3D

Compress3D introduces an innovative approach for generating high-quality 3D models from single images. This methodology significantly advances the efficiency of the generation process by introducing a triplane autoencoder architecture. The triplane autoencoder compresses 3D models into a compact latent space, enabling rapid and accurate generation of detailed assets. Furthermore, the proposed system leverages a two-stage diffusion model, employing both image and shape embeddings as conditions for generation. This dual conditioning mechanism notably improves the fidelity of the generated models in comparison to existing state-of-the-art methods.

Technical Overview

Triplane Autoencoder Architecture

The core of Compress3D's efficiency lies in its triplane autoencoder system, which efficiently encodes 3D models into a compressed latent space. This process involves:

  • Encoding: The triplane encoder compresses colored point clouds into a low-dimensional latent space, effectively condensing both geometry and texture information of 3D models. This is achieved by projecting 3D point-wise features onto 2D triplanes with added learnable parameters to preserve information during compression.
  • 3D-aware Cross-Attention Mechanism: To enhance the latent space's representation capacity, a 3D-aware cross-attention mechanism is employed. This mechanism queries features from a high-resolution 3D feature volume using low-resolution latent representations, thereby augmenting the expressive capability of the latent space with minimal computational overhead.
  • Decoding: The decoder reconstructs high-quality colored 3D models from the compressed triplane latent space. Utilizing a series of ResNet blocks and upsample layers, it decodes the geometry and texture back into a 3D representation.

Diffusion Model and Conditioning Strategy

Rather than solely depending on image embedding for 3D generation, Compress3D innovatively employs both image and shape embeddings as conditional inputs to a triplane latent diffusion model. The shape embedding, which contains richer 3D information, is deduced through a diffusion prior model conditioned on the image embedding. This conditioning approach significantly enriches the information fed into the generation process, resulting in increased accuracy and fidelity of the produced 3D models.

Experimental Validation and Results

Comprehensive experiments conducted to validate Compress3D's effectiveness demonstrate its superiority over current algorithms. Key findings include:

  • High-quality Generation: The approach yields high-quality 3D assets from single images in a mere 7 seconds on a single A100 GPU, outperforming existing methods in terms of both speed and quality.
  • Efficient Training: Remarkably, Compress3D requires less training data and time compared to the current state-of-the-art, showcasing its efficiency and potential for scalability.
  • Quantitative Metrics: The system achieves superior performance against benchmark metrics, including FID and CLIP similarity scores, validating the high fidelity of generated 3D models to their source images.

Implications and Future Prospects

The introduction of Compress3D presents several practical and theoretical implications for the field of AI and 3D content generation:

  • Efficiency and Accessibility: The method's efficiency in generating high-quality 3D models from limited data and computational resources makes advanced 3D modeling more accessible to a broader range of applications and users.
  • Enhanced 3D Representation: By efficiently leveraging both image and shape embeddings, Compress3D enhances the representation and understanding of three-dimensional geometry and texture from two-dimensional images.
  • Future Research Directions: The compressed latent space and dual-conditioning strategy open avenues for future research in 3D content generation, particularly in exploring further optimizations and applications in virtual reality, gaming, and cinematic productions.

Conclusion

Compress3D offers a groundbreaking advancement in the generation of 3D models from single images, characterized by its efficiency, reduced need for extensive training data, and superior generation quality. This work not only sets a new benchmark in the field but also paves the way for future advancements in efficient and accessible 3D content creation.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube