Instance-wise Depth and Motion Learning from Monocular Videos

Published 19 Dec 2019 in cs.CV, cs.LG, and cs.RO | (1912.09351v2)

Abstract: We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we propose a differentiable forward rigid projection module that plays a key role in our instance-wise depth and motion learning. Second, we design an instance-wise photometric and geometric consistency loss that effectively decomposes background and moving object regions. Lastly, we introduce a new auto-annotation scheme to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code and dataset will be available at https://github.com/SeokjuLee/Insta-DM.

Abstract PDF Upgrade to Chat

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a differentiable forward projection module that effectively handles dynamic object motion in monocular videos.
The paper incorporates instance-wise photometric and geometric consistency losses to separate background movement from object-specific motion.
The framework is validated on the KITTI dataset, demonstrating significant improvements in depth accuracy for dynamic scenes.

Instance-wise Depth and Motion Learning from Monocular Videos

The paper "Instance-wise Depth and Motion Learning from Monocular Videos" presents an innovative approach to the joint estimation of depth and the motion of dynamic objects using monocular video inputs, a task highly relevant to autonomous navigation systems. Traditionally, depth and motion estimation from monocular sequences have faced challenges due to the lack of supervision and dynamic environments where objects can exhibit independent motion distinct from the general ego-motion of the frame.

Key Contributions

The authors introduce a framework that addresses these challenges through three primary contributions:

Differentiable Forward Rigid Projection Module: The authors propose a forward image warping technique that acts as a differentiable module, essential for handling dynamic object motions within monocular sequences. Unlike traditional inverse warping, which struggles with moving objects and can cause spatial discrepancies, the forward projection maintains geometric plausibility, effectively addressing warping artifacts and hole-filling issues.
Instance-wise Photometric and Geometric Consistency Losses: By incorporating instance-wise loss functions, the authors separate background motion from object-specific motion. This is achieved by decomposing video frames into static and dynamic regions, based on predicted instance segmentation masks, and applying different view synthesis methodologies appropriate for each. This distinction permits accurate inference even in dynamic scenes.
Auto-annotation Scheme for Video Instance Segmentation: The paper introduces a self-supervised mechanism to auto-annotate video instances, producing segmentation masks critical for training. This component facilitates the decomposition process needed to optimize depth and motion predictions.

Performance Evaluation

The authors validate their approach using the KITTI dataset, demonstrating superior performance compared to contemporary methods in unsupervised depth and motion estimation. The framework achieves strong numerical results, improving upon the prior art with substantial margins across both static and dynamic regions of test videos. For instance, the paper notes improvements in Absolute Relative (AbsRel) error metrics, particularly within object regions, indicative of the model's ability to accurately discern depth in the presence of moving objects.

Implications and Future Directions

This work holds extensive implications for the development of self-supervised systems that require detailed motion and depth analysis from monocular video feeds. The framework addresses a key problem in autonomous vehicular systems, where distinguishing between static environments and dynamic objects is crucial for navigation and decision-making processes.

Further developments could include the extension of this methodology to more complex scenarios involving deformable objects or varied urban landscapes not extensively covered in the current dataset scope. Additionally, integrating this instance-aware mechanism with other perception models could enrich scene understanding tasks such as real-time segmentation and object detection.

Overall, this paper provides a solid foundation for future research in self-supervised depth and motion learning from monocular inputs, paving the way for refined models capable of real-world application in autonomous systems.

Markdown Report Issue