Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation (2111.01236v2)

Published 1 Nov 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to convolutional neural network (CNN)-based models. However, ViTs are mainly designed for image classification that generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness. Those approaches enabled HRViT to push the Pareto frontier of performance and efficiency on semantic segmentation to a new level, as our evaluation results on ADE20K and Cityscapes show. HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction, demonstrating the potential of HRViT as a strong vision backbone for semantic segmentation.

Citations (163)

View on Semantic Scholar

Summary

The paper’s main contribution is HRViT, a multi-scale vision transformer that fuses high-resolution features via a multi-branch architecture and augmented local self-attention.
It employs novel mixed-scale convolutional feedforward networks and efficient patch embedding to enhance feature extraction while maintaining low computational costs.
HRViT achieves state-of-the-art mIoU scores on ADE20K and Cityscapes, reducing parameters by 28% and FLOPs by 21%, making it ideal for dense prediction tasks like AR/VR.

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

The paper "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation" presents HRViT, an innovative vision Transformer architecture designed to address challenges in semantic segmentation. Vision Transformers (ViTs) have shown remarkable performance in image classification tasks, overtaking traditional convolutional neural networks (CNNs) in expressiveness and flexibility. However, their single-scale, low-resolution representations pose a significant hurdle when applied to dense prediction tasks like semantic segmentation, which demands high spatial precision and multi-scale semantic understanding.

Key Contributions

HRViT improves upon existing ViT architectures by integrating high-resolution multi-branch architectures and various optimization techniques to enhance performance and efficiency. The paper outlines several critical innovations:

Multi-Branch Parallel Architecture: Inspired by HRNet, HRViT employs a multi-branch architecture to maintain high-resolution features throughout the network. This design effectively allows cross-resolution interactions, ensuring that high-level and detailed information are consistently fused.
Augmented Local Self-Attention: The proposed attention mechanism features key-value sharing to eliminate redundancy and incorporates parallel convolutional paths to enhance local feature aggregation, textual expressivity, and computational efficiency.
Mixed-Scale Convolutional Feedforward Networks (MixCFN): These networks leverage mixed-scale depth-wise convolutions to enrich local information extraction across different scales, further bolstering the model's capacity for nuanced feature representation.
Efficient Patch Embedding & Dense Fusion Layers: By simplifying the patch embedding and optimizing fusion layers, HRViT reduces overhead without compromising feature richness, favorably balancing model efficiency and performance.

Numerical Results and Implications

HRViT demonstrates strong empirical performance across benchmark datasets. On ADE20K, it achieves 50.20% mIoU, and on Cityscapes, 83.16% mIoU. These results surpass existing state-of-the-art ViT models such as SegFormer and CSWin, with HRViT showing up to 2.26 mIoU improvement over its best competitors. Furthermore, HRViT reduces parameter count by 28% and FLOPs by 21%, underscoring its efficiency alongside performance gains.

The implications of these findings are substantial for both practical and theoretical pursuits in AI. Practically, HRViT offers a potent, efficient solution for semantic segmentation tasks in settings that require real-time processing, such as augmented reality (AR) and virtual reality (VR) applications. Theoretically, it reaffirms the potential of combining high-resolution architectures with attention mechanisms for dense prediction tasks, pushing the research boundaries of ViTs beyond image classification.

Future Directions

Future research may explore the adaptability of HRViT across other dense prediction tasks, such as object detection and instance segmentation. Additionally, investigating the scalability of HRViT in distributed and edge AI environments could yield insights into its real-world applications and limitations. The efficient architectural design principles espoused by HRViT may inspire further innovations in ViT frameworks stressing multi-resolution efficiency and high-quality feature extraction.