- The paper introduces AffinityNet to predict pixel-level semantic affinities and transform coarse CAMs into complete segmentation masks.
- It employs a random walk on CAM activations, boosting the mean IoU from 46.8 to 57.0 on the PASCAL VOC 2012 dataset.
- Results demonstrate a promising shift towards scalable weakly supervised segmentation, reducing reliance on full pixel annotations.
Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation
The paper "Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation" tackles the challenge of semantic segmentation when only image-level class labels are available. It introduces an innovative framework that synthesizes pixel-level segmentation labels to be used as training annotations, thus bridging the gap between weakly supervised learning and fully annotated datasets. The work focuses on addressing the inefficacy of current methods that predominantly segment only highly discriminative parts of objects by proposing the novel AffinityNet model to predict semantic affinities.
Problem Setting and Motivation
Semantic segmentation typically requires extensive pixel-level annotations, which are labor-intensive and costly to generate. To mitigate this, weakly supervised approaches use less detailed annotations such as image-level labels, making training datasets cheaper to compile but yielding coarser segmentations. The main problem with such weakly supervised techniques is their tendency to localize only the most discriminative parts of objects, neglecting less prominent yet relevant regions.
Methodology
The key contribution of this paper is AffinityNet, a deep neural network specifically designed to predict semantic affinities between pairs of adjacent coordinates in an image. The semantic affinities are then utilized in a random walk process to diffuse initial class activation map (CAM) responses across the image, effectively covering entire object areas.
Computing CAMs
CAMs serve as initial rough segmentation maps identifying the most discriminative regions of objects. These are generated using a typical classification network with global average pooling followed by a fully connected layer. The CAMs are then normalized and processed to identify salient regions for both objects and background.
AffinityNet Training
AffinityNet is trained on pairs of coordinates derived from the CAMs. By focusing on adjacent pairs within a small radius, the network leverages localized context to predict semantic affinities. The training data for AffinityNet is generated by identifying confident regions in the CAMs, assigning binary affinity labels to coordinate pairs. Positive pairs consist of coordinates within the same object or background region, while negative pairs span different classes. This binary labeling enables focused and effective learning of semantic affinities.
Random Walk Propagation
Once trained, AffinityNet's predicted affinities are used to create a transition probability matrix for random walk on the image's pixel grid. This facilitates the diffusion of initial CAM activations to spatially adjacent regions with similar semantics, thereby refining coarse CAMs into more accurate segmentation masks.
Numerical Results and Comparisons
The empirical results are evaluated on the PASCAL VOC 2012 dataset. The proposed method significantly outperforms previous state-of-the-art weakly supervised approaches relying on image-level labels:
- Initial CAMs achieve a mean Intersection-over-Union (IoU) score of 46.8.
- Incorporating AffinityNet’s random walk refinement increases the mean IoU to 57.0.
- With additional refinement using dense CRFs, the segmentation performance reaches 58.7 mean IoU, which convincingly surpasses other weakly supervised methods.
Moreover, the final trained deep neural networks leveraging the refined segmentation labels achieve commendable performance. For instance, the Ours-ResNet38 model achieves 61.7 mean IoU on the validation set, which substantially bridges the gap with fully supervised methods like DeepLab and sets a new benchmark for weakly supervised segmentation models.
Implications and Future Work
This research presents practical implications for reducing dependency on fully annotated datasets by effectively utilizing weak supervision. The proposed framework can contribute to the broader adoption of segmentation models in real-world applications where annotating every pixel is impractical. Future work could explore applications in transfer learning, where AffinityNet can leverage knowledge from auxiliary datasets to improve segmentation in target domains. Another promising direction is extending the concept to weakly supervised semantic boundary detection, given AffinityNet’s demonstrated ability to discern semantic edges.
In conclusion, this paper offers a robust solution to a significant hurdle in semantic segmentation, making strides towards more practical and scalable training paradigms through innovative use of neural networks and classic random walk techniques.