L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation (2204.03206v1)

Published 7 Apr 2022 in cs.CV

Abstract: Mining precise class-aware attention maps, a.k.a, class activation maps, is essential for weakly supervised semantic segmentation. In this paper, we present L2G, a simple online local-to-global knowledge transfer framework for high-quality object attention mining. We observe that classification models can discover object regions with more details when replacing the input image with its local patches. Taking this into account, we first leverage a local classification network to extract attentions from multiple local patches randomly cropped from the input image. Then, we utilize a global network to learn complementary attention knowledge across multiple local attention maps online. Our framework conducts the global network to learn the captured rich object detail knowledge from a global view and thereby produces high-quality attention maps that can be directly used as pseudo annotations for semantic segmentation networks. Experiments show that our method attains 72.1% and 44.2% mIoU scores on the validation set of PASCAL VOC 2012 and MS COCO 2014, respectively, setting new state-of-the-art records. Code is available at https://github.com/PengtaoJiang/L2G.

Citations (106)

View on Semantic Scholar

Summary

The paper introduces a local-to-global transfer process that refines CAMs from localized patches to improve segmentation accuracy.
It employs pseudo-annotations and saliency-enforced shape transfer, enabling effective segmentation without dense labels.
Experiments on PASCAL VOC and MS COCO demonstrate significant mIoU gains, validating the framework’s enhanced object region detection.

Overview of "L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation"

The paper introduces a novel approach, termed L2G, aimed at enhancing weakly supervised semantic segmentation (WSSS) through a local-to-global knowledge transfer framework. The methodology leverages class activation maps (CAMs) from local patches of images to improve the quality of attention maps used for subsequent segmentation tasks. This strategy serves to mitigate the inherent challenge in WSSS of precisely locating and labeling object regions despite only having access to image-level labels as opposed to detailed annotations.

The authors observe that classification models demonstrate improved object region detection when images are processed as local patches rather than in their entirety. This insight underpins the development of the L2G framework, where a local classification network extracts region-specific attentions from multiple image patches. Subsequently, a global network absorbs and integrates this information to refine the overall attention map with higher fidelity.

Key Methodological Insights

Local-to-Global Attention Transfer: The framework improves upon traditional CAM by using a local network to generate detailed attention maps from random local patches of an image. These maps inherently include more comprehensive object details. A global network then aggregates these local insights, refining the focus on object boundaries across the full image.
Pseudo-Annotations for Segmentation: The L2G framework produces attention maps that can be directly applied as pseudo-labels for training segmentation networks, a process that sidesteps the need for densely annotated labels.
Saliency-Enforced Shape Transfer: A supplementary component incorporates saliency-based shape information, ensuring that the generated attention maps discern object boundaries more accurately. Saliency maps from an off-the-shelf model serve to align attention regions more closely with actual object contours.

Experimental Performance

The method's efficacy was tested on the PASCAL VOC 2012 and MS COCO 2014 datasets. In quantitative terms, the L2G framework achieved notable improvements in mean Intersection over Union (mIoU) scores—72.1% on PASCAL VOC validation and 44.2% on MS COCO validation sets—where previous works showed lower performance. These results reflect the framework's superiority in capturing detailed object regions even with the relatively minimal guidance of image-level labels.

Practical and Theoretical Implications

Practically, this framework offers a means of reducing the reliance on fully annotated datasets, enabling more scalable and cost-efficient training of semantic segmentation models. From a theoretical perspective, the insights into local-to-global knowledge transfer provide a fertile ground for exploration in adjacent fields, such as image classification and object detection under weak supervision.

Furthermore, the use of saliency maps in knowledge distillation highlights a potential cross-domain application of this technique, suggesting new avenues for refining semantic segmentation through auxiliary information.

Speculation and Future Directions

Looking ahead, it is plausible that the L2G approach could integrate more sophisticated models such as Vision Transformers, leveraging their capacity to capture long-range dependencies more naturally. Additionally, further refinement of the saliency models involved could enhance boundary detection further, which is a recurrent limitation noted in failure cases.

In conclusion, the L2G framework represents a significant stride in the domain of weakly supervised semantic segmentation, marrying the strengths of localized attention extraction with global knowledge consolidation to improve segmentation quality in the absence of granular annotations. Future research could explore the integration of more advanced network architectures and augment the sources of auxiliary information for further improvements.

PDF Markdown

Related Papers

GitHub

GitHub - PengtaoJiang/L2G: The PyTorch Code for our CVPR 2022 paper "L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation" (57 stars)