Hire-MLP: Vision MLP via Hierarchical Rearrangement (2108.13341v2)

Published 30 Aug 2021 in cs.CV

Abstract: Previous vision MLPs such as MLP-Mixer and ResMLP accept linearly flattened image patches as input, making them inflexible for different input sizes and hard to capture spatial information. Such approach withholds MLPs from getting comparable performance with their transformer-based counterparts and prevents them from becoming a general backbone for computer vision. This paper presents Hire-MLP, a simple yet competitive vision MLP architecture via \textbf{Hi}erarchical \textbf{re}arrangement, which contains two levels of rearrangements. Specifically, the inner-region rearrangement is proposed to capture local information inside a spatial region, and the cross-region rearrangement is proposed to enable information communication between different regions and capture global context by circularly shifting all tokens along spatial directions. Extensive experiments demonstrate the effectiveness of Hire-MLP as a versatile backbone for various vision tasks. In particular, Hire-MLP achieves competitive results on image classification, object detection and semantic segmentation tasks, e.g., 83.8% top-1 accuracy on ImageNet, 51.7% box AP and 44.8% mask AP on COCO val2017, and 49.9% mIoU on ADE20K, surpassing previous transformer-based and MLP-based models with better trade-off for accuracy and throughput. Code is available at https://github.com/ggjy/Hire-Wave-MLP.pytorch.

Citations (102)

View on Semantic Scholar

Summary

The paper introduces a hierarchical rearrangement strategy that integrates both inner-region and cross-region token mixing to enhance local and global feature extraction.
The methodology leverages token reshaping and cyclic shifts, achieving impressive results such as 83.8% top-1 accuracy on ImageNet and superior object detection and segmentation metrics.
The design eliminates heavy self-attention operations, offering an efficient balance between performance and computational cost across various vision tasks.

Hire-MLP: Vision MLP via Hierarchical Rearrangement

The paper presents Hire-MLP, a novel vision MLP architecture designed to address limitations inherent in existing models like MLP-Mixer and ResMLP. These traditional models treat images as linearly flattened patches, limiting them in terms of flexibility and spatial information capture. Hire-MLP, by contrast, incorporates a hierarchical rearrangement strategy to harness both local and global spatial information effectively.

Hierarchical Rearrangement in Hire-MLP

The hierarchical rearrangement consists of two distinct methods: inner-region rearrangement and cross-region rearrangement. The inner-region rearrangement targets local features by reshaping and redistributing tokens within a given spatial area. It is implemented through rearrange and restore operations that allow localized token mixing via channel-mixing MLPs. The cross-region rearrangement extends these capabilities by realigning tokens across different spatial areas through cyclic shifts. This operation captures global context and inter-region token relationships, thus enhancing the model's understanding of large-scale structures.

Performance Evaluation

The effectiveness of Hire-MLP is demonstrated across various vision tasks, including image classification, object detection, and semantic segmentation. Notable achievements include:

Image Classification: Achieving top-1 accuracy of 83.8\% on ImageNet using Hire-MLP-Large.
Object Detection: Outperforming previous models with a box AP of 51.7\% and mask AP of 44.8\% on the COCO val2017 dataset when applying Hire-MLP-Small as a backbone.
Semantic Segmentation: Attaining a mean IoU of 49.9\% on ADE20K.

These results underscore Hire-MLP's capacity to provide a favorable balance between accuracy and computational cost. It offers substantial throughput improvements over contemporary models by eliminating computational-heavy self-attention mechanisms characteristic of transformer-based architectures.

Technical Implications and Future Work

The paper introduces a pyramid-like architecture in the Hire-MLP design that aggregates feature representations efficiently. This structure is analogous to those found in advanced CNNs and transformers, enhancing the model's suitability for diverse downstream tasks. Furthermore, the modularity of Hire-MLP through hierarchical rearrangement operations introduces flexibility that is efficient for adaptations across varied input resolutions.

Future work could explore further optimization avenues within hierarchical rearrangement practices, potentially integrating additional spatial and temporal context mechanisms to boost capability further. The explorations could be directed towards refining computational efficiency without compromising the richness of feature representations, making Hire-MLP an even more competitive backbone in the evolution of vision MLP architectures.

PDF Markdown

Related Papers

GitHub

GitHub - ggjy/Hire-Wave-MLP.pytorch: Implementation of Hire-MLP: Vision MLP via Hierarchical Rearrangement and An Image Patch is a Wave: Phase-Aware Vision MLP. (33 stars)