Scaling Vision with Sparse Mixture of Experts

Published 10 Jun 2021 in cs.CV, cs.LG, and stat.ML | (2106.05974v1)

Abstract: Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.

Abstract PDF Upgrade to Chat

Citations (484)

View on Semantic Scholar

Summary

The paper introduces the V-MoE architecture, replacing dense layers with sparse experts to scale vision models efficiently.
It demonstrates state-of-the-art performance on image recognition tasks while reducing inference cost by up to 50%.
The adaptive routing mechanism enables compute trade-offs, with a 15-billion parameter model achieving 90.35% accuracy on ImageNet.

Scaling Vision with Sparse Mixture of Experts

The paper "Scaling Vision with Sparse Mixture of Experts" introduces a novel approach to scaling vision models by utilizing sparse Mixture of Experts (MoEs) architectures, specifically adapting them for vision tasks. Sparse MoEs have previously shown success in NLP, effectively leveraging large model capacity with reduced computation. However, in the field of computer vision, dense networks still dominate. This paper proposes the Vision Mixture of Experts (V-MoE), a sparse variant of the Vision Transformer (ViT), demonstrating that it can rival the largest dense networks in performance while reducing computational requirements.

Key Contributions

The paper's contributions can be summarized as follows:

V-MoE Architecture: The proposed V-MoE replaces some dense feedforward layers in ViTs with sparse MoE layers, where image patches are routed to different experts, enhancing scalability and performance.
Efficient Inference: V-MoEs achieve state-of-the-art results on image recognition tasks with up to 50% less computational cost during inference compared to their dense counterparts.
Adaptive Compute: An extension to the routing algorithm is proposed, allowing for adaptive per-image compute, which makes models adjustable in performance-cost trade-offs during inference.
Scalability: The research successfully trains a 15-billion parameter model, achieving a remarkable 90.35% accuracy on ImageNet classification, showcasing the potential to scale vision models to unprecedented sizes.
Batch Prioritized Routing: This new routing algorithm prioritizes important image patches, reducing compute on uninformative patches and further saving resources.

Technical Insights

Conditional Computation and MoEs

The V-MoE utilizes conditional computation to enhance model efficiency, a method well-established in NLP but less explored in vision. By routing image patches to a subset of experts, the V-MoE reduces the number of parameters that need to be evaluated, thus achieving computational efficiency at scale. This tactical reduction in dense computation mimics successful strategies in NLP sparse MoE models, unlocking super-linear scaling benefits.

Practical Implications

The introduction of V-MoEs marks a significant step in efficient large-scale vision modeling. Notably, the ability to adjust inference costs through Batch Prioritized Routing without further training is a compelling feature for practical deployment. This adaptability illuminates a path toward more sustainable AI by reducing inference-related energy costs, aligning with growing environmental concerns.

Performance Analysis

V-MoE models outperformed dense equivalents on upstream and transfer learning tasks across several benchmarks. With careful architectural choices, including placing MoEs selectively and employing auxiliary loss functions for load balancing, V-MoEs demonstrate stable training dynamics and strong transfer capabilities.

Crucially, the study highlights V-MoEs' competitive nature not only in terms of reducing computational costs but also in attaining superior performance metrics compared to state-of-the-art dense models.

Future Directions

The exploration of V-MoEs opens several avenues for future research. Potential directions include refining the routing mechanisms for greater efficiency, extending the approach to other domains like multimodal and video data, and employing heterogeneous expert architectures. Additionally, the research encourages further exploration into sparse model designs that might reduce dependencies on large-scale datasets and improve on data-efficient training regimes.

Conclusion

This paper successfully demonstrates the application of sparse Mixture of Experts models in computer vision, achieving significant advancements in scalability and computational efficiency. The V-MoE introduces innovative architectural and algorithmic concepts that promise to reshape the landscape of efficient large-scale vision modeling.

Markdown Report Issue