Revisit Self-supervised Depth Estimation with Local Structure-from-Motion (2407.19166v2)

Published 27 Jul 2024 in cs.CV

Abstract: Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of learning-through-loss, this work proposes an alternative scheme by performing local SfM. First, with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs. For the first time, we show self-supervision within $5$ frames already benefits SoTA supervised depth and correspondence models. The project page is held in the link (https://shngjz.github.io/SSfM.github.io/).

Summary

The paper introduces a novel local SfM pipeline that integrates self-supervised depth estimation with pose and correspondence optimization.
It achieves certified global optimality and outperforms benchmarks like ScanNet and KITTI360 in key metrics such as δ < 0.5 and RMSE.
The method ensures consistent depth mapping and improved correspondence, benefiting applications in AR, VR, and autonomous driving.

Self-supervised Depth Estimation Revisited: Integrating Structure-from-Motion Techniques

The paper, "Revisit Self-supervised Depth Estimation with Local Structure-from-Motion," presents a novel approach that bridges the gap between self-supervised depth estimation and classic Structure-from-Motion (SfM) techniques for extracting scene depth from RGB videos. Despite their shared goals, these paradigms have traditionally remained disconnected. The proposed method, referred to as local SfM, leverages elements from both approaches to enhance depth estimation accuracy, particularly in scenarios involving limited input frames.

Methodology

The technique diverges from conventional self-supervised depth estimation methodologies that depend on photometric loss computed over pairs of adjacent frames. Instead, it employs a more integrated local SfM pipeline, encapsulating the following steps:

Depth and Correspondence Estimation: Initially, depth maps and pairwise correspondence maps are inferred from calibrated RGB or RGB-D images through a depth and correspondence estimator.
Bundle-RANSAC-Adjustment: Introducing a novel pose optimization algorithm, the method dynamically adjusts camera poses and depth values using Bundle-RANSAC-Adjustment. This algorithm integrates multi-view constraints while maintaining robustness and accuracy over short sequences of frames.
Depth Adjustment and Triangulation: Fixed camera poses are utilized in conjunction with a Neural Radiance Field (NeRF)-derived model, operating without a neural network, to perform dense triangulation and geometric verification. This final step computes sparse triangulated depths and consistently adjusted camera poses.

The algorithm's output comprises poses, depth adjustments, and triangulated sparse depths across multiple frames.

Experimental Results

The experiments demonstrate that self-supervision over only five frames significantly improves the performance of supervised depth and correspondence models. Specifically, the Bundle-RANSAC-Adjustment guarantees global optimality, surpassing contemporary methods—both optimization-based and neural network-based—in terms of pose estimation quality.

Depth Estimation Performance

The paper evaluates the method's impact on benchmark datasets, highlighting numerical improvements in depth estimation compared to leading models. The metrics, including $\delta < 0.5$ , $\delta < 1$ , and RMSE, show consistent enhancement:

ScanNet: Models like ZoeDepth achieved $0.877$ $\delta_{0.5}$ , significantly improved to $0.902$ by the proposed method.
KITTI360: Here, models like ZeroDepth showed remarkable improvement from $\delta_{0.5}$ of $0.584$ to $0.654$ through local SfM.

Certifiable Global Optimality

The proposed algorithm's pose optimization has demonstrated certified global optimality in experiments, often resulting in the accurate reconstruction of camera poses and scene depths. This is particularly evident in videos containing rapid camera movements or complex scenes with sparse feature points.

Diverse Applications

Beyond enhancing depth estimation, the proposed approach facilitates multiple downstream applications:

Consistent Depth Mapping: The derived depth adjustments ensure temporally consistent depth maps, which are crucial for augmented reality (AR) and virtual reality (VR) applications.
Improved Correspondence Estimation: When combined with RGB-D inputs, the method enables more accurate projective correspondence estimation. It is evidenced by performance improvements in key metrics like PCK-1 and AEPE in benchmarks involving datasets such as SUN3D and NYUv2.

Implications and Future Directions

Practical Implications

The method holds particular promise for real-time depth estimation in dynamic environments, enhancing the robustness of computer vision tasks such as autonomous driving and 3D reconstruction. The integration of local SfM techniques with self-supervised learning paradigms offers a scalable solution that can leverage vast amounts of unlabeled video data.

Theoretical Contributions

The approach challenges the conventional boundary between SfM and self-supervised learning, providing a theoretical framework for integrating these methodologies. This integrated perspective could inform future research on optimizing depth and pose estimation algorithms.

Future Developments

Potential future directions include:

Scalability Enhancements: The current implementation leverages NeRF-like triangulation, which can be computationally intensive. Future research may focus on optimizing these components for large-scale applications.
Expanded Applications: Extending the methodology to additional tasks such as simultaneous localization and mapping (SLAM) and real-time video processing could further validate its versatility and practical utility.

Conclusion

By revisiting self-supervision through the lens of local SfM, this paper offers substantial improvements in both depth and correspondence estimation. It opens new avenues for integrating unsupervised and supervised techniques in computer vision, suggesting a promising paradigm shift that harnesses the strengths of both approaches. The robust numerical results and practical implications underscore the method's potential to advance the state-of-the-art in visual scene understanding.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1818248695013597624