Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation

Published 19 Dec 2020 in cs.CV | (2012.10782v2)

Abstract: Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. To address this issue, we present a framework for semi-supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled image sequences. In particular, we propose three key contributions: (1) We transfer knowledge from features learned during self-supervised depth estimation to semantic segmentation, (2) we implement a strong data augmentation by blending images and labels using the geometry of the scene, and (3) we utilize the depth feature diversity as well as the level of difficulty of learning depth in a student-teacher framework to select the most useful samples to be annotated for semantic segmentation. We validate the proposed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains, and we achieve state-of-the-art results for semi-supervised semantic segmentation. The implementation is available at https://github.com/lhoyer/improving_segmentation_with_selfsupervised_depth.

Abstract PDF Upgrade to Chat

Citations (95)

View on Semantic Scholar

Summary

The paper’s main contribution is leveraging self-supervised depth estimation to improve segmentation through transfer learning, advanced augmentation, and smart annotation.
It uses self-supervised monocular depth features to enhance performance, achieving up to 92% of the full annotation baseline with only 1/30 of the labels.
The study demonstrates practical benefits for data-scarce fields, reducing annotation costs and facilitating efficient model training in applications like autonomous driving.

Enhancements in Semantic Segmentation through Self-Supervised Depth Estimation

The paper investigates the integration of self-supervised depth estimation (SDE) with semantic segmentation to alleviate the demand for labeled data, which serves as a significant bottleneck in the deployment of deep semantic segmentation networks. The authors propose a novel framework that enhances semi-supervised semantic segmentation by exploiting the depth information derived from self-supervised monocular depth estimation. This framework highlights three core contributions: knowledge transfer, advanced data augmentation, and automatic data selection for annotation.

The methodology begins with leveraging SDE as an auxiliary task. By utilizing the features learned during depth estimation, improvements in semantic segmentation are observed, especially when the labeled data is scarce. This transfer learning approach not only saves resources but also enriches the semantic segmentation process by integrating geometric insights inherent in depth estimation.

The second pivotal contribution comes in the form of DepthMix, a sophisticated data augmentation strategy. Traditional augmentation methods are often deficient in preserving the structural integrity of scenes. DepthMix, contrastingly, utilizes depth cues from SDE to intelligently blend images and their labels, thus respecting the inherent geometry. This approach reduces artifacts and maintains the realism of augmented images, crucial for effective model training.

Lastly, the paper introduces an innovative technique for automatic data selection adapted for annotation. By harnessing the depth feature diversity and the challenges associated with learning depth features, the method selects the most informative samples for annotation within a semi-supervised framework. This approach not only enhances model learning dynamics but also significantly trims annotation costs and human intervention, replacing the latter with a depth-estimation proxy.

Through empirical validation on the Cityscapes dataset, the proposed framework demonstrates substantial improvements. The systematic inclusion of SDE leads to performance increments in semantic segmentation metrics, showcasing state-of-the-art results, particularly in scenarios with limited labeling resources. Specifically, the proposed method achieves up to 92% of the full annotation baseline using only 1/30 of the labels and trivially surpasses previous baselines when expanded to 1/8 labels.

The implications are manifold. Practically, the integration of self-supervised auxiliary tasks like SDE could revolutionize fields dependent on pixel-precise segmentation without the heavy reliance on annotated datasets. Theoretically, this work underscores the potential of multitask learning frameworks in unlocking new levels of performance across related domains. Attempts to further align self-supervised learning with other forms of structured prediction tasks could refine such techniques even further.

In conclusion, the paper presents a compelling case for self-supervised depth estimation as a transformative tool in semantic segmentation. As the field progresses, embracing unsupervised and semi-supervised methods, particularly in data-hungry domains like autonomous driving and robotic vision, will be critical. Future research can explore the extension of these techniques to different visual domains and datasets, potentially catalyzing advances in robust, deployment-ready segmentation models.

Markdown Report Issue