Emergent Mind

Abstract

Simultaneous localization and mapping is essential for position tracking and scene understanding. 3D Gaussian-based map representations enable photorealistic reconstruction and real-time rendering of scenes using multiple posed cameras. We show for the first time that using 3D Gaussians for map representation with unposed camera images and inertial measurements can enable accurate SLAM. Our method, MM3DGS, addresses the limitations of prior neural radiance field-based representations by enabling faster rendering, scale awareness, and improved trajectory tracking. Our framework enables keyframe-based mapping and tracking utilizing loss functions that incorporate relative pose transformations from pre-integrated inertial measurements, depth estimates, and measures of photometric rendering quality. We also release a multi-modal dataset, UT-MM, collected from a mobile robot equipped with a camera and an inertial measurement unit. Experimental evaluation on several scenes from the dataset shows that MM3DGS achieves 3x improvement in tracking and 5% improvement in photometric rendering quality compared to the current 3DGS SLAM state-of-the-art, while allowing real-time rendering of a high-resolution dense 3D map. Project Webpage: https://vita-group.github.io/MM3DGS-SLAM

Overview

  • Introduces a novel SLAM framework, MM3DGS, that utilizes vision, depth, and inertial inputs to enhance trajectory tracking and map rendering.

  • MM3DGS employs 3D Gaussian splatting for real-time rendering and accurate map representation, improving upon previous sparse point cloud and neural radiance field methods.

  • The system achieves significant improvements by combining photometric loss functions with depth estimates for precise localization and mapping.

  • Tested on the UT-MM dataset, MM3DGS demonstrates superior tracking accuracy and rendering quality, indicating potential across various applications.

Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements

Introduction

Simultaneous Localization and Mapping (SLAM) serves as a critical component in a multitude of applications ranging from autonomous vehicle navigation to augmented reality. The choice of sensor input and map representation significantly influences the SLAM system's performance. Traditional approaches often rely on sparse visual inputs or depth data from high-cost sensors like LiDAR, potentially limiting their deployment in consumer-oriented applications. The paper introduces a novel framework for SLAM, designated as Multi-modal 3D Gaussian Splatting (MM3DGS), leveraging vision, depth, and inertial measurements. MM3DGS exhibits enhanced trajectory tracking and map rendering capabilities, enabled by the integration of inertial data and depth estimates with a 3D Gaussian map representation.

SLAM Map Representations

Existing SLAM techniques primarily utilize sparse point clouds or neural radiance fields for environmental mapping. While the former excels in tracking precision, the latter provides detailed, photorealistic reconstructions at the cost of computational efficiency. MM3DGS bridges this gap by employing 3D Gaussian splatting for real-time rendering and accurate map representation, overcoming the limitations associated with prior methods. This approach allows for scale-aware mapping, improved trajectory alignment, and efficient rendering without extensive scene-specific training.

Efficient 3D Representation and Multi-modal SLAM Frameworks

The implementation of 3D Gaussian splatting within MM3DGS demonstrates a significant advancement in utilizing explicit Gaussians for volumetric scene depiction, facilitating faster convergence and detailed scene reconstruction. The framework's ability to incorporate inertial measurements with visual and depth data addresses the common challenges posed by sensor limitations, enhancing robustness and tracking accuracy in dynamic environments.

Methodology

MM3DGS integrates pose optimization, keyframe selection, Gaussian initialization, and mapping into a cohesive framework, adept at handling inputs from easily accessible and low-cost sensors. By utilizing a combination of photometric loss functions and depth estimates, the system ensures precise localization and detailed environmental mapping. Notably, the method introduces a novel approach for integrating depth supervision, utilizing depth priors for Gaussian initialization, and optimizing map fidelity based on depth correlation loss.

Experimental Setup and Results

Evaluated on the custom-created UT Multi-modal (UT-MM) dataset, MM3DGS demonstrates a 3x improvement in tracking accuracy and a 5% enhancement in rendering quality over current state-of-the-art methods. These results are underpinned by the system's capacity to efficiently process multi-modal inputs, rendering high-resolution 3D maps in real-time. The release of the UT-MM dataset, encompassing a variety of indoor scenarios, provides a vital resource for further research and benchmarking in the field.

Conclusion and Future Directions

MM3DGS represents a significant stride towards achieving robust, efficient, and scalable SLAM using multi-modal sensor data, supported by a 3D Gaussian-based map representation. The framework's superior performance in both qualitative and quantitative evaluations underscores its potential applicability across diverse domains requiring real-time localization and mapping. Future work may explore tighter integration of inertial measurements, loop closure mechanisms, and extension to outdoor environments to further enhance the system's accuracy and applicability.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube