An Image Patch is a Wave: Phase-Aware Vision MLP (2111.12294v5)

Published 24 Nov 2021 in cs.CV

Abstract: In the field of computer vision, recent works show that a pure MLP architecture mainly stacked by fully-connected layers can achieve competing performance with CNN and transformer. An input image of vision MLP is usually split into multiple tokens (patches), while the existing MLP models directly aggregate them with fixed weights, neglecting the varying semantic information of tokens from different images. To dynamically aggregate tokens, we propose to represent each token as a wave function with two parts, amplitude and phase. Amplitude is the original feature and the phase term is a complex value changing according to the semantic contents of input images. Introducing the phase term can dynamically modulate the relationship between tokens and fixed weights in MLP. Based on the wave-like token representation, we establish a novel Wave-MLP architecture for vision tasks. Extensive experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation. The source code is available at https://github.com/huawei-noah/CV-Backbones/tree/master/wavemlp_pytorch and https://gitee.com/mindspore/models/tree/master/research/cv/wave_mlp.

Citations (116)

View on Semantic Scholar

Summary

The paper introduces a novel phase-aware token mixing module that treats each image patch as a wave with amplitude and phase components.
The proposed Wave-MLP architecture achieves state-of-the-art results, including 82.6% ImageNet top-1 accuracy at 4.5 GFLOPs, surpassing similar models.
The phase dynamics enable flexible, content-aware feature aggregation, offering promising avenues for extending wave representations to broader neural architectures.

An Image Patch is a Wave: Phase-Aware Vision MLP

The paper "An Image Patch is a Wave: Phase-Aware Vision MLP" introduces a novel approach to enhancing the capabilities of Vision Multi-Layer Perceptrons (MLPs) in computer vision tasks. Traditionally, Vision MLP architectures focus on the efficient processing of image patches (tokens) using fixed-weight aggregation methods. This paper proposes an innovative strategy that leverages wave-like token representations to address the limitations of fixed aggregation methods.

Key Contributions

The authors propose representing each image token as a wave function comprising amplitude and phase components. This approach allows the introduction of phase dynamics, enabling the aggregation of tokens with variable semantic richness, which the traditional MLP approaches with fixed weights fail to accommodate. Key aspects of this research include:

Wave Representation: Each token is treated as a wave, with the amplitude representing the original feature and the phase offering a dynamic modulating factor. This dual representation introduces a complex-valued domain where the phase aids in dynamically adjusting the aggregation based on semantic content.
Phase-Aware Token Mixing Module (PATM): The proposed module is essential to the architecture, aggregating tokens by considering the semantic differences represented in their phases. By leveraging basic operations involving phases, like the element-wise sum of their real and imaginary components, tokens with similar contents positively enhance one another.
Wave-MLP Architecture: Developed the Wave-MLP architecture that surpasses existing state-of-the-art Vision MLP models on tasks such as image classification, object detection, and semantic segmentation. The phase-aware approach provides significant improvements in feature aggregation, resulting in enhanced model performance.

Numerical Results

The paper presents comprehensive evaluations of the Wave-MLP architecture. Noteworthy results include:

The Wave-MLP-S model achieved an 82.6% top-1 accuracy on the ImageNet dataset with 4.5 GFLOPs, outperforming the Swin Transformer and other models with similar computations.
In dense prediction tasks like object detection on the COCO dataset, the Wave-MLP backbones integrated with detectors such as RetinaNet and Mask R-CNN yielded considerable improvements in Average Precision (AP) compared to counterparts like Swin-T.
On the ADE20K dataset for semantic segmentation, Wave-MLP variants consistently surpassed existing models, reflecting the effectiveness of their dynamic token aggregation strategy.

Implications and Future Directions

The methodological advancements made in the wave representation of tokens hold promises for further development in simplifying and enhancing MLP models for vision tasks. The introduction of phase components expands the expressive capacity of MLPs and potentially opens avenues for exploration in other domains beyond computer vision.

Future research may investigate the extension of phase-aware mechanisms to other neural structures, examining the broader applicability of wave-like representations. It also beckons the question of integrating these phase dynamics into traditional architectures like CNNs or hybrid transformer models for potentially synergistic enhancements.

Conclusion

The paper's proposed Wave-MLP architecture demonstrates that by framing tokens as wave-like entities, significant improvements in performance across various vision tasks can be achieved. This novel perspective allows for more nuanced and content-aware feature aggregation, addressing some foundational limitations of existing vision MLP models. As the field progresses, these insights may catalyze a re-evaluation of token processing strategies across different architectural paradigms.