Emergent Mind

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

(2404.19702)
Published Apr 30, 2024 in cs.CV

Abstract

We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks. Our project webpage is available at: https://sai-bi.github.io/project/gs-lrm/ .

Transformer-based model predicting 3D Gaussian parameters from sparse posed images, visualized as point clouds.

Overview

  • GS-LRM is a new 3D reconstruction model that utilizes a transformer architecture and Gaussian primitives to create high-quality 3D models from 2-4 sparse images, offering significant improvements in speed and quality on GPU.

  • The model excels in handling complex relational reasoning during the reconstruction process and predicts Gaussians for each pixel to maintain high detail and texture of the original scenes or objects.

  • GS-LRM has demonstrated superior performance metrics in both object and scene reconstructions and shows promising applications in fields like virtual reality, digital heritage preservation, and real estate.

Understanding GS-LRM: 3D Reconstruction from Sparse Images Enhanced by Transformer and Gaussian Splatting Techniques

Introduction to the Model

The paper presents a model named GS-LRM, a new framework for reconstructing high-quality 3D models from a sparse set of images (2-4 views), utilizing a transformer architecture that predicts 3D Gaussian primitives for rendering. This method significantly improves both object and scene reconstructions, encompassing various scales and complexities with unprecedented speed and quality on a GPU.

Key Features and Approach

Transformative Architecture:

  • The model leverages a transformer-based architecture, breaking away from traditional NeRF-based systems which often struggle with speed and scalability, particularly when handling detailed, large-scale scenes.
  • Input images are processed into tokens, similar to words in a sentence, using a technique called patchify. These tokens are then fed to a series of transformer blocks that handle complex relational reasoning to predict the 3D structure.

Efficient Gaussian Parameter Prediction:

  • Instead of generating a 3D volume or set of planes, this model predicts Gaussian primitives that describe the 3D points directly. Each pixel in the input images corresponds to a 3D Gaussian, providing a direct mapping that retains high-quality details and textures.
  • These Gaussians encapsulate color, scale, rotation, and translucency, offering a rich, articulate representation of the original scenes or objects.

Performance Metrics

GS-LRM has demonstrated outstanding results across two main experimental setups: object reconstruction and scene reconstruction:

  • For object reconstruction, the model achieves a 4dB improvement in PSNR over existing state-of-the-art methods for certain datasets.
  • For scene reconstruction, it outperformed competitors by up to 2.2dB in PSNR.

These strong performance indicators suggest that the approach isn’t just theoretically sound but also practically superior.

Practical and Theoretical Implications

In practical scenarios, GS-LRM can be employed in fields like virtual reality, where rapid, high-fidelity 3D model creation from limited images enhances user experience and system efficiency. In digital heritage preservation or real-estate display, the ability to quickly generate 3D representations from a few photographs could significantly reduce the cost and time required for detailed 3D modeling.

Theoretically, the work extends the understanding of how transformers, typically used in NLP, can be effectively adapted for visual and spatial data, dealing efficiently with the complexities inherent in multi-view 3D reconstruction. It also showcases the scalability of Gaussian splatting as a successful alternative to volume rendering for real-time applications.

Future Horizons

Looking ahead, potential areas of development might involve:

  • Resolution Enhancements: Pushing the boundaries to handle higher resolutions such as 1K or 2K could open up further applications in high-end simulation systems.
  • Autonomous Camera Parameter Estimation: Integrating systems that can deduce camera parameters from images could make the model more robust and user-friendly, particularly for consumer-grade applications.
  • Handling Unseen Regions: Improvements in algorithms that can speculate or interpolate parts of the scene not captured in the input images could provide a more comprehensive solution.

Conclusion

The GS-LRM model sets a new benchmark in the field of 3D reconstruction by leveraging advanced AI techniques to process sparse images rapidly and accurately. Its versatility in handling different scales and complexities makes it a promising tool for both present applications and future exploration in computer vision and AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.