Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations (1904.05044v3)

Published 10 Apr 2019 in cs.CV and cs.LG

Abstract: This paper presents a novel approach for learning instance segmentation with image-level class labels as supervision. Our approach generates pseudo instance segmentation labels of training images, which are used to train a fully supervised model. For generating the pseudo labels, we first identify confident seed areas of object classes from attention maps of an image classification model, and propagate them to discover the entire instance areas with accurate boundaries. To this end, we propose IRNet, which estimates rough areas of individual instances and detects boundaries between different object classes. It thus enables to assign instance labels to the seeds and to propagate them within the boundaries so that the entire areas of instances can be estimated accurately. Furthermore, IRNet is trained with inter-pixel relations on the attention maps, thus no extra supervision is required. Our method with IRNet achieves an outstanding performance on the PASCAL VOC 2012 dataset, surpassing not only previous state-of-the-art trained with the same level of supervision, but also some of previous models relying on stronger supervision.

Citations (506)

View on Semantic Scholar

Summary

The paper introduces IRNet which generates pseudo segmentation labels using CAMs and inter-pixel relations to train a robust model.
It leverages displacement fields to delineate object boundaries without extra supervision, outperforming state-of-the-art methods on PASCAL VOC 2012.
The approach reduces annotation needs while enhancing scene understanding, offering scalable solutions in weakly supervised learning.

Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations

The paper under review presents an advancement in the field of weakly supervised learning for instance segmentation, utilizing image-level class labels as the sole form of supervision. This paper introduces a novel mechanism to generate pseudo instance segmentation labels, which are subsequently employed to train a fully supervised model. The authors propose the Inter-pixel Relation Network (IRNet) as the cornerstone of this methodology, which estimates instance areas and detects boundaries between object classes.

Contributions and Approach

The primary contribution lies in the method's ability to utilize class attention maps (CAMs) to identify confident seed areas of object classes. IRNet then propagates these seeds to delineate entire instance areas with precise boundaries. This approach refrains from requiring additional supervision or segmentation proposals, unlike many existing methods, by leveraging inter-pixel relations derived from CAMs.

Key highlights of the methodology include:

Pseudo Label Generation: The method synthesizes pseudo instance segmentation labels by integrating CAMs with class-agnostic instance maps. This integration allows for accurate identification and localization of instances without the reliance on off-the-shelf proposals.
Inter-pixel Relation Network (IRNet): IRNet estimates displacement fields and class boundaries, offering a robust way to distinguish between instances of the same class. It employs a displacement vector field that helps identify pixels pointing towards common centroid locations, thus classifying them within the same instance.
Training with Inter-pixel Relations: The network learns effectively from inter-pixel relationships that are inferred from CAMs without requiring direct ground truth segmentation masks. This is achieved by delineating class equivalence and displacements between interconnected pixel pairs.

Performance and Implications

The proposed framework achieved superior performance on the PASCAL VOC 2012 dataset, surpassing previous state-of-the-art methods relying on image-level supervision and even some utilizing stronger supervision levels such as bounding boxes. For instance, the method demonstrated significant improvements in average precision for instance segmentation, evidencing its robustness.

The implications of this research are manifold:

Scalability: The ability to perform instance segmentation with only image-level labels greatly reduces the annotation burden, making it scalable for diverse datasets.
Enhanced Semantic Understanding: By improving the segmentation quality through refined boundary determination and class-agnostic maps, this work contributes to more nuanced scene understanding.
Potential for Further Research: The insights gained from using inter-pixel relationships may inspire further exploration into more efficient weakly supervised learning techniques and their applications across different domains.

Future Directions

Looking forward, the principles outlined in this paper pave the way for enhanced AI capabilities in tasks requiring less explicit supervision. For instance, further refinement of inter-pixel relations could lead to superior object detection models, even in cluttered or ambiguous visual environments. Moreover, adapting these techniques to video data could provide continuous learning capabilities, wherein temporal coherence between frames is integrated with spatial relationships.

In conclusion, this paper presents a substantive contribution toward weakly supervised instance segmentation, providing practical solutions to reduce annotation needs while maintaining high performance. The introduction of IRNet offers a promising direction for future advances and applications in the broader field of computer vision.

PDF Markdown