Expectation-Maximization Attention Networks for Semantic Segmentation

Published 31 Jul 2019 in cs.CV | (1907.13426v2)

Abstract: Self-attention mechanism has been widely used for various tasks. It is designed to compute the representation of each position by a weighted sum of the features at all positions. Thus, it can capture long-range relations for computer vision tasks. However, it is computationally consuming. Since the attention maps are computed w.r.t all other positions. In this paper, we formulate the attention mechanism into an expectation-maximization manner and iteratively estimate a much more compact set of bases upon which the attention maps are computed. By a weighted summation upon these bases, the resulting representation is low-rank and deprecates noisy information from the input. The proposed Expectation-Maximization Attention (EMA) module is robust to the variance of input and is also friendly in memory and computation. Moreover, we set up the bases maintenance and normalization methods to stabilize its training procedure. We conduct extensive experiments on popular semantic segmentation benchmarks including PASCAL VOC, PASCAL Context and COCO Stuff, on which we set new records.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (513)

View on Semantic Scholar

Summary

The paper presents an EM-based attention module that reformulates self-attention into iterative EM steps to reduce computational and memory demands.
It employs a lightweight EMA Unit integrated within CNNs, achieving improved mIoU scores and setting new performance records on benchmarks like PASCAL VOC and COCO Stuff.
The method offers practical benefits for resource-constrained settings, inspiring future research in efficient attention mechanisms for real-time computer vision applications.

Expectation-Maximization Attention Networks for Semantic Segmentation

The paper "Expectation-Maximization Attention Networks for Semantic Segmentation" introduces a novel approach to the self-attention mechanism, which has been instrumental in capturing long-range dependencies in computer vision tasks. The authors propose the Expectation-Maximization Attention (EMA) module, which reformulates the traditional attention mechanism into an expectation-maximization (EM) framework. This reformulation aims to address the computational inefficiency and high memory demands of existing attention mechanisms.

Key Contributions

EM-Based Reformulation: The EMA module leverages the EM algorithm to iteratively estimate a compact set of bases for computing attention maps. This approach reduces the rank of resulting representations and efficiently denoises input data. By doing so, it manages to maintain robustness against input variability while optimizing for memory and computational resources.
Efficient Architecture: The EMA is embedded into a modular neural network unit called the EMA Unit (EMAU). When compared to existing architectures, this unit is lightweight and seamlessly integrates into existing convolutional neural networks (CNNs). Notably, the EMAU is constructed with common operators, aiding its implementation simplicity.
Empirical Validation: The authors demonstrate the effectiveness of the EMA module through extensive experiments on established semantic segmentation benchmarks such as PASCAL VOC, PASCAL Context, and COCO Stuff. The results set new performance records on these datasets, showcasing its superiority over current state-of-the-art methods.

Technical Overview

The EMA module operates through iterative processes that mimic the EM algorithm. It consists of three primary operations:

Responsibility Estimation (E step): Computes the responsibility map $\mathbf{Z}$ for each pixel with respect to a compact set of bases $\bm{\mu}$ .
Likelihood Maximization (M step): Updates the bases to maximize the likelihood of the complete data, ensuring the bases remain representative of significant semantic concepts while minimizing noise.
Data Re-estimation: After convergence, the module reconstructs the input data as a weighted sum of the learned bases, producing a low-rank, denoised output.

Performance and Comparisons

EMANet, the architecture incorporating EMAU, was shown to outperform current models like DeeplabV3 and PSANet with significantly reduced computational and memory requirements. Detailed evaluations on PASCAL VOC test sets demonstrated a notable improvement in mIoU scores when leveraging EMANet. The paper further highlights the adaptability of the proposed method across different datasets, achieving the leading performance on PASCAL Context and COCO Stuff benchmarks.

Implications and Future Directions

The integration of EM iterations into attention mechanisms presents a promising direction for enhancing the efficiency and effectiveness of semantic segmentation tasks. This work opens potential avenues for further research, particularly in refining the compactness of attention mechanisms and exploring analogous applications in other domains, such as natural language processing and broader areas of computer vision.

Given the consistent performance improvements across various benchmarks, the EMA approach from this paper may inspire the development of more resource-efficient models that maintain high accuracy, potentially driving innovation in real-time applications and edge computing scenarios where computational resources are limited.

Markdown Report Issue