Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation (1803.10464v2)

Published 28 Mar 2018 in cs.CV

Abstract: The deficiency of segmentation labels is one of the main obstacles to semantic segmentation in the wild. To alleviate this issue, we present a novel framework that generates segmentation labels of images given their image-level class labels. In this weakly supervised setting, trained models have been known to segment local discriminative parts rather than the entire object area. Our solution is to propagate such local responses to nearby areas which belong to the same semantic entity. To this end, we propose a Deep Neural Network (DNN) called AffinityNet that predicts semantic affinity between a pair of adjacent image coordinates. The semantic propagation is then realized by random walk with the affinities predicted by AffinityNet. More importantly, the supervision employed to train AffinityNet is given by the initial discriminative part segmentation, which is incomplete as a segmentation annotation but sufficient for learning semantic affinities within small image areas. Thus the entire framework relies only on image-level class labels and does not require any extra data or annotations. On the PASCAL VOC 2012 dataset, a DNN learned with segmentation labels generated by our method outperforms previous models trained with the same level of supervision, and is even as competitive as those relying on stronger supervision.

Authors (2)

Jiwoon Ahn (2 papers)
Suha Kwak (63 papers)

Citations (698)

View on Semantic Scholar

Summary

The paper introduces AffinityNet to predict pixel-level semantic affinities and transform coarse CAMs into complete segmentation masks.
It employs a random walk on CAM activations, boosting the mean IoU from 46.8 to 57.0 on the PASCAL VOC 2012 dataset.
Results demonstrate a promising shift towards scalable weakly supervised segmentation, reducing reliance on full pixel annotations.

Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation

The paper "Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation" tackles the challenge of semantic segmentation when only image-level class labels are available. It introduces an innovative framework that synthesizes pixel-level segmentation labels to be used as training annotations, thus bridging the gap between weakly supervised learning and fully annotated datasets. The work focuses on addressing the inefficacy of current methods that predominantly segment only highly discriminative parts of objects by proposing the novel AffinityNet model to predict semantic affinities.

Problem Setting and Motivation

Semantic segmentation typically requires extensive pixel-level annotations, which are labor-intensive and costly to generate. To mitigate this, weakly supervised approaches use less detailed annotations such as image-level labels, making training datasets cheaper to compile but yielding coarser segmentations. The main problem with such weakly supervised techniques is their tendency to localize only the most discriminative parts of objects, neglecting less prominent yet relevant regions.

Methodology

The key contribution of this paper is AffinityNet, a deep neural network specifically designed to predict semantic affinities between pairs of adjacent coordinates in an image. The semantic affinities are then utilized in a random walk process to diffuse initial class activation map (CAM) responses across the image, effectively covering entire object areas.

Computing CAMs

CAMs serve as initial rough segmentation maps identifying the most discriminative regions of objects. These are generated using a typical classification network with global average pooling followed by a fully connected layer. The CAMs are then normalized and processed to identify salient regions for both objects and background.

AffinityNet Training

AffinityNet is trained on pairs of coordinates derived from the CAMs. By focusing on adjacent pairs within a small radius, the network leverages localized context to predict semantic affinities. The training data for AffinityNet is generated by identifying confident regions in the CAMs, assigning binary affinity labels to coordinate pairs. Positive pairs consist of coordinates within the same object or background region, while negative pairs span different classes. This binary labeling enables focused and effective learning of semantic affinities.

Random Walk Propagation

Once trained, AffinityNet's predicted affinities are used to create a transition probability matrix for random walk on the image's pixel grid. This facilitates the diffusion of initial CAM activations to spatially adjacent regions with similar semantics, thereby refining coarse CAMs into more accurate segmentation masks.

Numerical Results and Comparisons

The empirical results are evaluated on the PASCAL VOC 2012 dataset. The proposed method significantly outperforms previous state-of-the-art weakly supervised approaches relying on image-level labels:

Initial CAMs achieve a mean Intersection-over-Union (IoU) score of 46.8.
Incorporating AffinityNet’s random walk refinement increases the mean IoU to 57.0.
With additional refinement using dense CRFs, the segmentation performance reaches 58.7 mean IoU, which convincingly surpasses other weakly supervised methods.

Moreover, the final trained deep neural networks leveraging the refined segmentation labels achieve commendable performance. For instance, the Ours-ResNet38 model achieves 61.7 mean IoU on the validation set, which substantially bridges the gap with fully supervised methods like DeepLab and sets a new benchmark for weakly supervised segmentation models.

Implications and Future Work

This research presents practical implications for reducing dependency on fully annotated datasets by effectively utilizing weak supervision. The proposed framework can contribute to the broader adoption of segmentation models in real-world applications where annotating every pixel is impractical. Future work could explore applications in transfer learning, where AffinityNet can leverage knowledge from auxiliary datasets to improve segmentation in target domains. Another promising direction is extending the concept to weakly supervised semantic boundary detection, given AffinityNet’s demonstrated ability to discern semantic edges.

In conclusion, this paper offers a robust solution to a significant hurdle in semantic segmentation, making strides towards more practical and scalable training paradigms through innovative use of neural networks and classic random walk techniques.

PDF Markdown