Mobile-Former: Bridging MobileNet and Transformer

Published 12 Aug 2021 in cs.CV and cs.LG | (2108.05895v3)

Abstract: We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP in RetinaNet framework. Furthermore, we build an efficient end-to-end detector by replacing backbone, encoder and decoder in DETR with Mobile-Former, which outperforms DETR by 1.1 AP but saves 52\% of computational cost and 36\% of parameters.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (408)

View on Semantic Scholar

Summary

The paper introduces Mobile-Former, a novel architecture that fuses the efficient local processing of MobileNet with the global representation capabilities of Transformers using a lightweight two-way bridge.
It decouples local and global feature extraction into parallel tracks, achieving 77.9% top-1 accuracy on ImageNet with a 17% reduction in computation compared to MobileNetV3.
Mobile-Former also improves real-world applications by boosting object detection performance by 8.6 AP and reducing model parameters by up to 36% when used in frameworks like DETR.

An Examination of Mobile-Former: Integrating MobileNet with Transformers

The paper "Mobile-Former: Bridging MobileNet and Transformer" presents a novel neural network architecture that addresses the trade-offs between computational efficiency and performance in vision-related tasks. The architecture, dubbed "Mobile-Former," synthesizes the local processing strength of MobileNet with the global representation capacity of Transformers. This is achieved through a parallel design structure that connects the two components with a two-way bridge, facilitating efficient bidirectional feature exchange.

Architecture and Design

At the core of Mobile-Former is its parallel framework which decouples local and global feature processing into distinct tracks—namely, MobileNet and Transformer. MobileNet operates via its well-known efficient depthwise and pointwise convolutions, processing image data locally, while the Transformer acts with a constrained set of global tokens to encapsulate global interactions. This architecture diverges from traditional vision transformers that typically have higher computation costs due to large token sets derived from image patches.

A key innovation of Mobile-Former is its two-way bridge, implemented using a lightweight cross-attention mechanism. This bridge concurrently optimizes both MobileNet and Transformer processing pathways, promoting mutual feature enhancement with minimal computational overhead. The elimination of key and value projection matrices on the MobileNet side and placing of the bridge at the bottleneck layers result in significant FLOP savings, albeit still increasing the representation power.

Numerical Performance and Claims

The empirical assessment of Mobile-Former spans several FLOP regimes, ranging from 25M to 500M. On ImageNet classification, Mobile-Former demonstrates superior performance compared to MobileNetV3, achieving 77.9% top-1 accuracy at 294M FLOPs while reducing computational demands by 17%. When incorporated into object detection frameworks like RetinaNet, Mobile-Former outperforms MobileNetV3 by 8.6 AP, highlighting its potential in real-world applications.

Additionally, when replacing the DETR architecture with Mobile-Former in the backbone, encoder, and decoder, noteworthy improvements are observed. Here, Mobile-Former outpaces DETR by 1.1 AP while markedly reducing computation and parameter requirements (52% and 36% savings, respectively).

Theoretical and Practical Implications

Mobile-Former introduces an efficient pathway for utilizing transformers in scenarios previously dominated by efficient CNNs, primarily due to strict computational constraints. The design guide provided by the paper suggests that localized feature processing and global interaction modeling can be detached in a modular network design, allowing for a customizable balance between performance and efficiency. This opens avenues for applying similar architectures in mobile and edge devices, where computational resources are a premium.

From a theoretical perspective, the architecture raises potential exploration of network architectures that emphasize modularity and parallelism, furthering the debate between architectural purity and hybridization in AI model designs.

Prospective Developments

Future advancements may focus on refining the components of the two-way bridge, optimizing implementations to further enhance computational efficiency, or discovering novel applications across a broader range of visual tasks. Additionally, exploration into varying the number of global tokens or adjusting the tensor dimensions offers opportunities for exploring performance optimizations without degrading inference speed.

Through its hybrid design, Mobile-Former manages to successfully align the efficiency of local feature extractors with the robust representation power of transformers, and points towards new possibilities in the ongoing evolution of AI model architecture. The paper provides a promising step in the conversation on how to adapt and merge disparate model architectures beyond theoretical constructs into practical benefits.

Markdown Report Issue