Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Strip Pooling: Rethinking Spatial Pooling for Scene Parsing (2003.13328v1)

Published 30 Mar 2020 in cs.CV

Abstract: Spatial pooling has been proven highly effective in capturing long-range contextual information for pixel-wise prediction tasks, such as scene parsing. In this paper, beyond conventional spatial pooling that usually has a regular shape of NxN, we rethink the formulation of spatial pooling by introducing a new pooling strategy, called strip pooling, which considers a long but narrow kernel, i.e., 1xN or Nx1. Based on strip pooling, we further investigate spatial pooling architecture design by 1) introducing a new strip pooling module that enables backbone networks to efficiently model long-range dependencies, 2) presenting a novel building block with diverse spatial pooling as a core, and 3) systematically comparing the performance of the proposed strip pooling and conventional spatial pooling techniques. Both novel pooling-based designs are lightweight and can serve as an efficient plug-and-play module in existing scene parsing networks. Extensive experiments on popular benchmarks (e.g., ADE20K and Cityscapes) demonstrate that our simple approach establishes new state-of-the-art results. Code is made available at https://github.com/Andrew-Qibin/SPNet.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Qibin Hou (82 papers)
  2. Li Zhang (693 papers)
  3. Ming-Ming Cheng (185 papers)
  4. Jiashi Feng (295 papers)
Citations (449)

Summary

Strip Pooling: Rethinking Spatial Pooling for Scene Parsing

In the paper titled "Strip Pooling: Rethinking Spatial Pooling for Scene Parsing," the authors introduce a novel methodology for improving scene parsing tasks effectively. Scene parsing, also known as semantic segmentation, is a crucial task in computer vision that requires assigning semantic labels to individual pixels in images. The paper delineates a new pooling approach, termed strip pooling, which employs strip-shaped kernels rather than the conventional square ones to capture both global and local contextual information more efficiently.

Key Concepts and Methodology

The paper challenges the traditional N×NN \times N spatial pooling by proposing strip pooling, which uses elongated 1×N1 \times N or N×1N \times 1 kernels. This design allows for the effective modeling of long-range dependencies without the drawbacks of large square pooling that can include irrelevant background information.

The paper details several contributions, starting with the Strip Pooling Module (SPM). This module is strategically embedded within backbone networks to extend receptive fields, capturing global contexts while preserving local details. The SPM achieves this by executing pooling independently along horizontal and vertical axes, followed by feature modulation through one-dimensional convolutions.

Moreover, the authors propose a Mixed Pooling Module (MPM). Unlike previous pyramid pooling modules, the MPM integrates both the novel strip pooling and traditional methods to capture varied contextual information. It comprehensively enables scene parsing networks to secure informative representations across different spatial hierarchies.

Performance Evaluation

The paper presents extensive experimental results across several benchmark datasets, including ADE20K, Cityscapes, and Pascal Context. The proposed architecture, termed SPNet, outperforms existing state-of-the-art models in terms of mean Intersection of Union (mIoU) and pixel accuracy on these datasets:

  • For the ADE20K dataset, SPNet achieves an mIoU of 45.60% using a ResNet-101 backbone, surpassing previous methods.
  • On the Cityscapes test set, SPNet records a mIoU of 82.0%, demonstrating its superiority in parsing cityscape scenes accurately.
  • The significant improvement continued on the Pascal Context dataset with an achieved mIoU of 54.5%.

Implications and Future Directions

The introduction of strip pooling presents substantial practical implications for scene parsing. By efficiently leveraging structural dependencies along different spatial dimensions, strip pooling can potentially be integrated into other vision tasks, such as object detection or instance segmentation, for improved spatial contextualization. The methodology's lightweight and plug-and-play nature allows it to be straightforwardly incorporated into various existing architectures.

Future research may focus on refining the strip pooling's implementation, potentially exploring adaptive kernel shapes or enhancing computational efficiency. Moreover, extending this approach to accommodate varying data modalities, such as 3D data in LiDAR scans or multispectral images, may open further avenues for enhancement in complex scenes interpretation.

In summary, the paper delivers an innovative contribution to the field of scene parsing through strip pooling, advancing the understanding and capability of semantic segmentation networks. The adoption of such architectures signifies progression in model efficiency and accuracy in challenging and diverse environments.

Github Logo Streamline Icon: https://streamlinehq.com