MonoSelfRecon: Purely Self-Supervised Explicit Generalizable 3D Reconstruction of Indoor Scenes from Monocular RGB Views (2404.06753v1)

Published 10 Apr 2024 in cs.CV

Abstract: Current monocular 3D scene reconstruction (3DR) works are either fully-supervised, or not generalizable, or implicit in 3D representation. We propose a novel framework - MonoSelfRecon that for the first time achieves explicit 3D mesh reconstruction for generalizable indoor scenes with monocular RGB views by purely self-supervision on voxel-SDF (signed distance function). MonoSelfRecon follows an Autoencoder-based architecture, decodes voxel-SDF and a generalizable Neural Radiance Field (NeRF), which is used to guide voxel-SDF in self-supervision. We propose novel self-supervised losses, which not only support pure self-supervision, but can be used together with supervised signals to further boost supervised training. Our experiments show that "MonoSelfRecon" trained in pure self-supervision outperforms current best self-supervised indoor depth estimation models and is comparable to 3DR models trained in fully supervision with depth annotations. MonoSelfRecon is not restricted by specific model design, which can be used to any models with voxel-SDF for purely self-supervised manner.

Summary

The paper presents a novel self-supervised approach that reconstructs explicit 3D indoor meshes from monocular RGB sequences without relying on supervised depth annotations.
It employs an autoencoder architecture using voxel-SDF estimations and a generalizable NeRF, enhanced by innovative photometric and geometric self-supervised losses.
Experimental results showcase superior depth and mesh reconstruction performance, highlighting potential applications in robotics, VR, and autonomous navigation.

MonoSelfRecon: Purely Self-Supervised Explicit Generalizable 3D Reconstruction of Indoor Scenes from Monocular RGB Views

Overview

The paper "MonoSelfRecon: Purely Self-Supervised Explicit Generalizable 3D Reconstruction of Indoor Scenes from Monocular RGB Views" introduces a novel framework, MonoSelfRecon, for self-supervised 3D scene reconstruction. The proposed methodology is significant as it aligns three crucial standards in 3D reconstruction: explicit mesh representation, generalizability to different indoor scenes, and the elimination of the need for large-scale annotations for training. The framework is based on an autoencoder architecture that decodes voxel-SDF (signed distance function) and employs a generalizable Neural Radiance Field (NeRF) to facilitate self-supervised learning.

Key Contributions

Framework Design: MonoSelfRecon is the first to achieve explicit 3D mesh reconstruction for indoor scenes with monocular RGB sequences, purely through self-supervised training on voxel-SDF. The distinctiveness lies in its ability to generate explicit 3D meshes without relying on supervised depth or SDF annotations.
Self-Supervised Losses: The paper proposes novel self-supervised losses that are designed to enhance the framework's performance. These losses are not only effective for pure self-supervision but can also be integrated with supervised losses to improve fully-supervised training outcomes.
Generality and Flexibility: MonoSelfRecon is not confined to specific model designs, allowing it to be extended to any model with voxel-SDF estimation. This flexibility maintains the advantages of the original model, such as inference speed and memory efficiency.

Experimental Results

The paper provides a comprehensive experimental evaluation comparing MonoSelfRecon with state-of-the-art (SOTA) self-supervised methods and supervised techniques. Key numerical results are as follows:

Depth Estimation: When trained purely in self-supervision, MonoSelfRecon outperforms existing self-supervised depth estimation models (e.g., P2Net, StructDepth) and is comparable to some fully-supervised approaches.
3D Mesh Reconstruction: The 3D mesh metrics reveal that MonoSelfRecon achieves superior results in generalizability while maintaining high precision and recall. It also proves to be effective in scenes where supervised depth estimation might introduce inconsistencies, such as depth layering or sparsity.

Methodology

Autoencoder-based Architecture

MonoSelfRecon employs an autoencoder architecture where the encoder processes the input monocular RGB sequences to produce a latent space representation. This representation is then decoded into voxel-SDF and a generalizable Neural Radiance Field (NeRF).

Loss Functions

Several self-supervised losses are introduced:

SDF Photometric Loss: Ensures that voxel-SDF estimations are photometrically consistent across different views.
SDF Co-Planar Loss: Leverages geometric regularities in indoor scenes to enforce planar consistency in SDF estimations.
Depth Consistency Loss: Improves SDF estimation by enforcing consistency between pseudo-SDF depth and NeRF-rendered depth.

These loss functions collectively guide the network to accurately reconstruct the 3D scene by ensuring multi-view and geometric consistency.

Implications and Future Directions

MonoSelfRecon demonstrates the feasibility of combining self-supervision with generalizable explicit 3D representation, paving the way for practical applications in areas like robotics, virtual reality, and autonomous navigation. Its ability to operate without extensive supervised data makes it particularly valuable for scenarios where obtaining such data is impractical or costly.

Future developments in this research direction could focus on extending the framework to accommodate outdoor environments, where depth variations are more significant. Moreover, further refining the architecture to handle continuous 3D space without interpolation could enhance its applicability and accuracy.

Conclusion

The paper presents a significant advancement in the field of 3D reconstruction by introducing a framework capable of self-supervised, generalizable, and explicit 3D mesh reconstruction from monocular RGB sequences. By addressing the limitations of existing methods, MonoSelfRecon provides a robust solution for indoor scene reconstruction, demonstrating superior performance across various benchmarks. The implications of this research extend to improving AI systems' efficiency and adaptability in diverse environments, representing a meaningful step forward in 3D computer vision.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1778279742778835186

https://twitter.com/CSVisionPapers/status/1778454715979694165