Conformer: Local Features Coupling Global Representations for Visual Recognition

Published 9 May 2021 in cs.CV | (2105.03889v1)

Abstract: Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Code is available at https://github.com/pengzhiliang/Conformer.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (469)

View on Semantic Scholar

Summary

The paper presents Conformer, a hybrid model that combines CNNs for local feature extraction with transformers for global representation learning.
It introduces a dual-branch architecture linked by Feature Coupling Units to harmonize multi-resolution local and global features.
Experimental results show Conformer outperforms DeiT-B on ImageNet and ResNet-101 on MSCOCO, marking significant improvements.

Conformer: Local Features Coupling Global Representations for Visual Recognition

The paper introduces "Conformer," a hybrid network architecture that effectively integrates Convolutional Neural Networks (CNNs) with visual transformers to enhance representation learning for visual recognition tasks. The proposed model addresses the limitations of CNNs in capturing global representations and transformers in retaining local feature details, presenting a robust solution for various computer vision challenges.

Overview

CNNs have been pivotal in computer vision, excelling in extracting local features but struggling with capturing global contextual information. Conversely, visual transformers have exhibited strength in global representation through self-attention mechanisms but tend to undermine local feature fidelity. The Conformer model synergizes these complementary strengths using a dual-branch structure composed of a CNN branch and a transformer branch. This dual structure is interconnected by the Feature Coupling Unit (FCU), which facilitates the fusion of local and global features across different resolutions.

Methodology

Network Structure:

The CNN branch is designed following a feature pyramid structure, akin to ResNet, which incrementally reduces spatial resolution while increasing channel depth. This branch excels in maintaining local details.
The transformer branch adheres to a ViT-like strategy employing self-attention to form global representations from non-overlapping patch embeddings extracted from the input image.
FCUs operate as the integration point between these branches, harmonizing the dimensional and semantic discrepancies between local feature maps and global patch embeddings using techniques like convolution, down/up sampling, and normalization strategies.

Learning and Inference:

During training, individual cross-entropy losses guide both branches, ensuring harmonious learning of CNN-style local features and transformer-style global representations. In inference, predictions from both branches contribute to the overall output.

Experimental Results

Conformer demonstrates superior performance across different benchmarks. Notably, it achieves a 2.3% improvement over DeiT-B on ImageNet and outperforms ResNet-101 on MSCOCO by 3.7% and 3.6% mAP for object detection and instance segmentation, respectively. These results, achieved under similar parameter complexity, underscore its effectiveness as a general-purpose backbone network.

Implications and Future Directions

The implications of this research are significant for the evolution of hybrid model architectures in machine learning. By effectively balancing local and global feature processing, Conformer sets a precedent for future neural network designs, potentially impacting areas such as data-efficient learning and real-time image processing. Ongoing work could explore extensions of the Conformer framework to other domains such as natural language processing, where similar local-global representations dichotomy exists.

The research posits a compelling argument for the integration of CNNs and transformers, highlighting enhancements in both convergence speed and generalization capability, particularly in invariance to image transformations. Future developments could involve exploring different configurations and depths of interaction between the two branches to further amplify performance without proportionally increasing computational costs.

Conformer exemplifies a sophisticated approach to melding distinct neural architectural strengths, driving forward the capabilities and applications of visual recognition systems. It provides a robust backbone for tackling increasingly complex visual tasks, signifying a meaningful convergence of CNN and transformer paradigms.

Markdown Report Issue