Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration (2207.10447v2)

Published 21 Jul 2022 in cs.CV

Abstract: Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Recent studies leverage the advantage of self-attention in visual Transformer for long-range dependency to re-active semantic regions, aiming to avoid partial activation in traditional class activation mapping (CAM). However, the long-range modeling in Transformer neglects the inherent spatial coherence of the object, and it usually diffuses the semantic-aware regions far from the object boundary, making localization results significantly larger or far smaller. To address such an issue, we introduce a simple yet effective Spatial Calibration Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model. Specifically, we introduce a learnable parameter to dynamically adjust the semantic correlations and spatial context intensities for effective information propagation. In practice, SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost. The object-sensitive localization ability is implicitly embedded into the Transformer encoder through optimization in the training phase. It enables the generated attention maps to capture the sharper object boundaries and filter the object-irrelevant background area. Extensive experimental results demonstrate the effectiveness of the proposed method, which significantly outperforms its counterpart TS-CAM on both CUB-200 and ImageNet-1K benchmarks. The code is available at https://github.com/164140757/SCM.

Citations (25)

View on Semantic Scholar

Summary

The paper presents a novel transformer architecture augmented with a Spatial Calibration Module to refine object localization from image-level labels.
Utilizing a diffusion-based Activation Diffusion Block, the method improves spatial coherence and boundary precision without added inference cost.
Experiments on benchmarks like CUB-200 show significant gains, including an 8.9% improvement over baseline approaches.

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

The paper in focus presents a novel approach to Weakly Supervised Object Localization (WSOL) using a transformer architecture augmented with a Spatial Calibration Module (SCM). This work addresses a central difficulty in WSOL: achieving accurate object localization using only image-level labels. Traditional methods suffer due to their reliance on class activation mapping, which can limit activation to the most discriminative parts of an image, leading to incomplete object localization. Although transformers offer promising capabilities owing to their self-attention mechanisms, which can handle long-range dependencies, they often fall short due to the neglect of spatial coherence and boundary precision.

Core Proposal

The authors propose an SCM designed to improve the precision of object localization by embedding spatial coherence information into the transformer’s attention maps. This module utilizes a diffusion model to disseminate the identification of an object across spatially and semantically related areas in an image. Notably, the SCM is an external module embedding spatial calibration during training to harness spatial continuity without introducing computational overhead during inference, as it is removed post-training.

Methodology

Spatial Calibration Module (SCM): The SCM functions by refining both semantic and attention maps generated by the transformer. It utilizes a diffusion process that incorporates both spatial coherence and semantic similarities across image patches for better object boundary estimation.
Activation Diffusion Block (ADB): The cornerstone of SCM is the ADB, which operates by applying the Newton Schulz iteration to approximate the inverse of the Laplacian matrix. This matrix represents spatial and semantic connections, thus facilitating the propagation of activation by taking into account both the spatial relations and semantic coherence.
Dynamic Filtering: Within each ADB, dynamic filtering preserves significant object-related activations while suppressing noise, based on a learnable parameter that adjusts thresholds dynamically during diffusion operations.

Evaluation and Results

The implementation of SCM with vision transformers shows substantial improvements when evaluated on standard WSOL benchmarks, such as CUB-200 and ImageNet-1K. The incorporation of SCM demonstrated significant performance enhancement in localization accuracy, with reported improvements over the baseline method, TS-CAM, by significant margins (8.9% on GT-known for CUB-200). These improvements are achieved without extra computational cost during practical inference, making SCM particularly attractive for real-world applications requiring fine object delineation from coarse labels.

Implications and Future Directions

By enhancing the transformer architecture with a principled model for implicit spatial calibration, the work significantly broadens the application scope of transformers in WSOL tasks, traditionally dominated by CNN-based methods. This approach leverages the nuanced capabilities of transformers while simultaneously addressing their limitations in spatial coherence modeling.

Theoretically, this method may be further explored across various image-to-image translation tasks where spatial detail retention is critical. Practically, SCM's lightweight nature during inference makes it suitably deployed in computationally constrained environments, such as mobile or embedded systems.

In future work, potential improvements could focus on automating the iterative process of ADB fine-tuning, perhaps by integrating adaptive mechanisms that learn optimal diffusion iterations or parameters, further enhancing its versatility and efficiency. Additionally, extension of this approach to more complex scenes and multi-object environments, possibly involving inter-object relational modeling, could pave the way for richer and more dynamic deployments of WSOL systems.