Joint Unsupervised Learning of Optical Flow and Depth by Watching Stereo Videos

Published 8 Oct 2018 in cs.CV | (1810.03654v1)

Abstract: Learning depth and optical flow via deep neural networks by watching videos has made significant progress recently. In this paper, we jointly solve the two tasks by exploiting the underlying geometric rules within stereo videos. Specifically, given two consecutive stereo image pairs from a video, we first estimate depth, camera ego-motion and optical flow from three neural networks. Then the whole scene is decomposed into moving foreground and static background by compar- ing the estimated optical flow and rigid flow derived from the depth and ego-motion. We propose a novel consistency loss to let the optical flow learn from the more accurate rigid flow in static regions. We also design a rigid alignment module which helps refine ego-motion estimation by using the estimated depth and optical flow. Experiments on the KITTI dataset show that our results significantly outperform other state-of- the-art algorithms. Source codes can be found at https: //github.com/baidu-research/UnDepthflow

Abstract PDF Upgrade to Chat

Citations (14)

View on Semantic Scholar

Summary

The paper presents a joint unsupervised method that leverages stereo videos to jointly estimate optical flow and depth, significantly halving error rates on the KITTI benchmark.
It decomposes scenes into static and moving components to enforce geometric consistency, using rigid flow and deep neural networks for depth and ego-motion estimation.
The method adapts conventional models like PWC-net for stereo disparity estimation, offering promising applications for autonomous navigation and object detection.

Unsupervised Learning of Optical Flow and Depth in Stereo Videos

This paper presents a method for the joint unsupervised learning of optical flow and depth estimation using stereo video data. The researchers exploit stereo video inputs to address the limitations traditionally found in monocular and unsupervised settings. The premise is rooted in leveraging geometrical consistency across consecutive stereo images, allowing for improved scene understanding without direct supervision.

The work outlines a framework that incorporates several interconnected modules: deep neural networks estimate depth, camera ego-motion, and optical flow from stereo frames. A novel aspect of this approach lies in the decomposition of scenes into static and moving components. This segmentation is achieved by comparing estimated optical flow with rigid flow, which is derived from depth and camera motion estimates. In static regions, a consistency loss encourages optical flow to learn from this more precise rigid flow. The method further refines pose estimation through a rigid alignment module that adjusts ego-motion using both depth and optical flow estimates.

Numerical evaluations of the methodology on the KITTI dataset reveal notable improvements over existing techniques. The proposed model significantly outperforms prior unsupervised methods, notably reducing optical flow error rates—identified on KITTI 2012 and 2015 datasets—achieving results comparable to supervised approaches. For example, the model halves the error rate compared to previous state-of-the-art unsupervised methods on KITTI 2012. These advancements are largely credited to the integrated handling of depth and optical flow and the utilization of stereo video, which provides richer geometrical insights than monocular data alone.

Detailed architectural choices further streamline the learning process. For instance, modifying PWC-net specifically to accommodate stereo disparity estimation illustrates the thoughtful adaptation of existing models to the problem domain. Such adaptations highlight the complexities and precision needed in transforming conventional monocular vision frameworks for stereo applications.

The implications of this research extend to several applications in autonomous systems, where depth perception and motion understanding are crucial. The proposed unsupervised method presents a pathway to more robust and flexible learning systems that do not rely on extensive labeled datasets. By improving scene flow understanding, the approach could enhance tasks such as object detection and autonomous navigation in complex environments.

While this paper contributes considerable progress to unsupervised learning tasks in stereo vision, it also acknowledges areas for future exploration. Motion segmentation, while improved, remains a limitation and potential area for enhancing the accuracy of rigid flow propagation. Future research could explore more sophisticated segmentation strategies and extend the applicability of the method to highly dynamic environments, where static scene assumptions are less viable.

In summary, the proposed joint learning approach offers a cohesive framework that leverages stereo video data to advance the unsupervised learning of optical flow and depth. It represents a significant step forward in utilizing geometric consistency to address the limitations of previous unsupervised methodologies and sets the stage for further research into deeply integrated perception systems.

Markdown Report Issue