Emergent Mind

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

(2407.19985)
Published Jul 29, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNE$'$s adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.

MoNE's token importance: Fewer image tokens processed with increasing threshold on router logits.

Overview

  • The paper introduces Mixture of Nested Experts (MoNE), an adaptive computational framework for processing visual tokens that leverages nested submodels to efficiently manage computational resources within a fixed parameter space.

  • MoNE utilizes the Expert Preferred Routing (EPR) algorithm to dynamically assign visual tokens to appropriate nested experts based on their computed priority, optimizing the use of more expensive models for important tokens and simpler submodels for less critical ones.

  • Experimental results on datasets like ImageNet-21k and Kinetics-400 show that MoNE outperforms existing models like MatViT and achieves significant reductions in inference time and computation while maintaining baseline performance, proving its effectiveness in both image and video classification tasks.

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Introduction

The computational processing of visual data, particularly images and videos, entails significant redundancy, yet existing models such as Vision Transformers (ViTs) process every token with equal emphasis, leading to inefficiencies. The "Mixture of Experts" (MoE) framework addresses some scalability issues, but at the cost of a larger parameter footprint that increases storage and handling requirements. This paper introduces the "Mixture of Nested Experts" (MoNE) framework, which dynamically allocates computational resources within a fixed parameter space, leveraging nested submodels to process visual tokens more efficiently.

Methodology

The MoNE framework builds on two prevailing concepts: nested models and mixture of experts. Nested models, such as those used in MatFormer, enable the extraction of multiple representations from the same data by structured slicing of the parameter space. In MoNE, this nested architecture allows different experts, represented by submodels with shared but progressively larger parameters, to operate within the same overall model footprint. The distinct contribution of this framework lies in its adaptive processing capability: a router network dynamically assigns tokens to different nested experts based on their computed priority.

Expert Preferred Routing (EPR)

The adaptive assignment of tokens to nested experts in MoNE is driven by the Expert Preferred Routing (EPR) algorithm. EPR evaluates the importance of tokens using router predictions and greedily assigns them to experts, prioritizing larger models, thereby enforcing a capacity distribution to meet computational constraints. This dynamic routing ensures that only the necessary and most informative tokens are processed by more computationally expensive models, while less critical tokens are handled by smaller, cheaper submodels.

Experimental Results

The practical benefits of MoNE were validated across various standard image and video datasets, including ImageNet-21k, Kinetics-400, and Something-Something-v2. The results demonstrated that MoNE achieves baseline performance with over two-fold reductions in inference time computation.

Image Classification

The framework was tested with three model sizes (S, B, and L) on ImageNet-21k. It was observed that MoNE outperforms both MatViT's nested submodels and Mixture of Depths (MoD), especially at lower FLOP constraints. The S/16 model, in particular, benefited from isoFLOPs training, where additional training epochs were accommodated to match total training FLOPs of the baseline models, showing additional performance gains.

Video Classification

For video processing, MoNE was integrated with the Factorized Encoder architecture of ViViT. Significant computational savings were noted, with a reduction in FLOPs by over two-folds while maintaining the baseline performance on datasets like Kinetics-400 and Something-Something-v2.

Adaptability and Capacity Distribution

MoNE's ability to adapt to varying inference-time budgets was examined. Models trained at specific capacities retained strong performance when evaluated at nearby capacities, solidifying the framework's applicability in dynamic environments. Moreover, training a single model with capacities sampled randomly ensured adaptability across a wide range of inference-time budgets without retraining.

Analysis and Visualizations

The analysis of router positioning and quantity indicated that placing a single router at the first layer and propagating its decisions yielded optimal performance. Visualizations depicted that tokens processed through the largest nested models correspond to regions of interest, confirming the effective allocation of token processing resources.

Conclusion

MoNE represents a significant step towards efficient processing in vision transformers by leveraging adaptive computation and nested submodels. This framework maintains computational efficacy within a fixed parameter budget, making it suitable for real-world applications where computational resources are constrained.

Future Directions

Future research could explore extending MoNE to tasks such as object detection and captioning. Furthermore, adapting MoNE for auto-regressive language models presents an intriguing yet non-trivial challenge, opening new avenues for optimization in large-scale language models. By providing a more energy-efficient inference method, MoNE also contributes to the broader goal of sustainable AI deployment.

Societal Impact

MoNE has the potential to reduce the carbon footprint associated with model deployment by dynamically allocating resources based on computational budgets. This can democratize access to high-performance models, fostering inclusivity in AI advancements.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube