Deep Watershed Transform for Instance Segmentation (1611.08303v2)

Published 24 Nov 2016 in cs.CV

Abstract: Most contemporary approaches to instance segmentation use complex pipelines involving conditional random fields, recurrent neural networks, object proposals, or template matching schemes. In our paper, we present a simple yet powerful end-to-end convolutional neural network to tackle this task. Our approach combines intuitions from the classical watershed transform and modern deep learning to produce an energy map of the image where object instances are unambiguously represented as basins in the energy map. We then perform a cut at a single energy level to directly yield connected components corresponding to object instances. Our model more than doubles the performance of the state-of-the-art on the challenging Cityscapes Instance Level Segmentation task.

Citations (510)

View on Semantic Scholar

Summary

The paper presents a CNN-based watershed transform that computes energy maps to delineate object instances with clear energy basins.
It integrates directional prediction and energy mapping in an end-to-end framework to enhance boundary localization and avoid over-segmentation.
Results on Cityscapes show significant AP improvements, making the approach effective for real-time robotics and autonomous driving applications.

Deep Watershed Transform for Instance Segmentation

The paper "Deep Watershed Transform for Instance Segmentation" by Min Bai and Raquel Urtasun introduces a method for instance segmentation utilizing a convolutional neural network (CNN) inspired by the classical watershed transform. This approach aims to simplify the complex task pipelines typically used in instance segmentation by leveraging end-to-end neural networks to directly compute energy maps, resulting in instance delineation through energy basins.

Overview

Instance segmentation requires not only the classification of each pixel's semantic category but also the association of pixels with specific object instances. This task is fundamental for applications in robotics and autonomous driving, where multiple object scales and occlusions may complicate segmentation in street scenes. The traditional methods involve intricate architectures with CRFs, RNNs, and object proposals. In contrast, the proposed method combines the topographical concept of the watershed transform with deep learning principles to create a cohesive and efficient model.

Methodology

The core innovation of this research is to train a CNN to predict an energy landscape where each object instance forms a distinct energy basin. The watershed approach traditionally applied to image gradients has been adapted here using deep learning to alleviate over-segmentation issues. The process involves:

Direction Network (DN): An intermediate task is introduced to predict the direction of descent of the watershed energy at each pixel, enabling the network to focus on boundary localization. This part of the network is pre-trained to output a unit vector signifying the direction away from the nearest object boundary.
Watershed Transform Network (WTN): This network processes the directional information to output a discretized energy map, effectively representing the watershed transform, with emphasis on accurate boundary representation critical to avoiding over-segmentation.
End-to-End Training: The network fine-tunes jointly to optimize both direction prediction and energy mapping, leading to an output that can be thresholded to yield object instances with connected components directly.

Results

The proposed model was tested on the Cityscapes benchmark for instance segmentation, outperforming existing state-of-the-art methods by more than doubling their AP score. The concise model architecture ensures a balance between speed and precision, with significant improvements noted across all classes of objects.

Discussion and Implications

The simplicity and effectiveness of this approach mark a substantial shift from the heavily engineered pipelines traditionally used in instance segmentation. Practically, such a model can be integrated into real-time systems for on-road vehicle perception and robotics. Theoretically, it paves the way for exploring hybrid models that combine bottom-up and top-down strategies, as well as joint models for semantic and instance segmentation.

Future Directions

The paper points out that future efforts should focus on improving segmentation in cases of occlusion and exploring semantic-instance joint segmentation models. Further refinement might involve developing strategies for better semantic class robustness or iterative refinement schemes to handle more complex scene compositions.

The Deep Watershed Transform method offers a compelling step forward in the field of computer vision, demonstrating that classical methods like watershed transforms can be rejuvenated and effectively utilized through the integration of modern deep learning frameworks.

PDF Markdown