Emergent Mind

Abstract

Neural implicit scene representations have recently shown encouraging results in dense visual SLAM. However, existing methods produce low-quality scene reconstruction and low-accuracy localization performance when scaling up to large indoor scenes and long sequences. These limitations are mainly due to their single, global radiance field with finite capacity, which does not adapt to large scenarios. Their end-to-end pose networks are also not robust enough with the growth of cumulative errors in large scenes. To this end, we introduce PLGSLAM, a neural visual SLAM system capable of high-fidelity surface reconstruction and robust camera tracking in real-time. To handle large-scale indoor scenes, PLGSLAM proposes a progressive scene representation method which dynamically allocates new local scene representation trained with frames within a local sliding window. This allows us to scale up to larger indoor scenes and improves robustness (even under pose drifts). In local scene representation, PLGSLAM utilizes tri-planes for local high-frequency features with multi-layer perceptron (MLP) networks for the low-frequency feature, achieving smoothness and scene completion in unobserved areas. Moreover, we propose local-to-global bundle adjustment method with a global keyframe database to address the increased pose drifts on long sequences. Experimental results demonstrate that PLGSLAM achieves state-of-the-art scene reconstruction results and tracking performance across various datasets and scenarios (both in small and large-scale indoor environments).

Overview

  • Neural implicit scene representations show promise for 3D SLAM but struggle with large scenes.

  • PLGSLAM is introduced, a neural SLAM system for large-scale indoor scene reconstruction and camera tracking.

  • The method uses progressive scene representation, joint tri-planes, and MLP networks for detailed reconstructions.

  • Incorporates a local-to-global bundle adjustment for improved pose estimation accuracy and robustness.

  • Outperforms existing methods in experiments, indicating aptitude for autonomous driving and augmented reality.

Overview

Neural implicit scene representations have emerged as an innovative approach for dense visual SLAM (Simultaneous Localization and Mapping). These representations have demonstrated promising results in rendering dense 3D environments from visual data. However, scaling up to larger indoor scenes and longer sequences has been a challenge, often resulting in deteriorated scene reconstruction and localization performance. This is partly due to the limitations of using a single, global radiance field and end-to-end pose networks which are not robust enough to handle large environments.

Methodology

The paper introduces PLGSLAM, a new neural visual SLAM system, which aims to deliver high-fidelity surface reconstruction along with robust camera tracking in real-time for large-scale indoor scenes. The system addresses scalability and robustness issues through several innovations:

  • Progressive Scene Representation: It dynamically allocates new local scene representations as the camera explores the environment, effectively dividing the scene into multiple manageable parts, which improves scalability and robustness.
  • Joint Tri-Planes and MLP Networks: Local scene representations utilize tri-planes for encoding local high-frequency features and MLP networks for low-frequency characteristics, providing detailed, smooth, and complete reconstructions, even in previously unobserved areas.
  • Local-to-Global Bundle Adjustment: The system incorporates a local-to-global bundle adjustment method, which mitigates pose drift over long sequences by utilizing a global keyframe database.

Experimental results have shown that PLGSLAM outperforms other methods in both scene reconstruction fidelity and pose estimation accuracy across various datasets and scenarios.

Experimental Results

PLGSLAM has been tested on multiple datasets that feature different indoor environments, ranging from smaller rooms to large multi-room apartments. The system demonstrated state-of-the-art performance in terms of accuracy of 3D reconstruction and pose estimation compared to existing methods. The experiments highlighted PLGSLAM's ability to handle long video sequences and large-scale indoor scenes effectively.

Conclusion and Future Work

The paper presents PLGSLAM as a system capable of addressing key challenges associated with scaling neural implicit scene representations for dense visual SLAM in large indoor settings. With its novel scene representation and adjustment techniques, the system shows significant improvements in both scene reconstruction and localization tasks. The proposed method paves the way for more robust and accurate SLAM systems that could greatly benefit applications in autonomous driving, robotics, and augmented reality. The code will be open-sourced, allowing for wider use and further development upon paper acceptance.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.