Emergent Mind

Abstract

Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. In MoE, each token in the input sequence activates a different subset of experts determined by a routing mechanism. However, the unchosen experts in MoE models do not contribute to the output, potentially leading to underutilization of the model's capacity. In this work, we first conduct exploratory studies to demonstrate that increasing the number of activated experts does not necessarily improve and can even degrade the output quality. Then, we show that output distributions from an MoE model using different routing strategies substantially differ, indicating that different experts do not always act synergistically. Motivated by these findings, we propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference. In SCMoE, the next-token probabilities are determined by contrasting the outputs from strong and weak activation using the same MoE model. Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding. Experiments on several benchmarks (GSM8K, StrategyQA, MBPP and HumanEval) demonstrate that SCMoE can consistently enhance Mixtral 8x7B's reasoning capability across various domains. For example, it improves the accuracy on GSM8K from 61.79 to 66.94. Moreover, combining SCMoE with self-consistency yields additional gains, increasing major@20 accuracy from 75.59 to 78.31.

SCMoE outperforms ensemble routing across various benchmarks.

Overview

  • Mixture-of-Experts (MoE) models activate a subset of model parameters (experts) for each input, but not all of them thus some experts remain unused.

  • The Self-Contrast Mixture-of-Experts (SCMoE) model leverages these unchosen experts at the inference stage to refine predictions without significant computational overhead.

  • SCMoE improves performance across various benchmarks by using a contrastive approach between strong and weak activation strategies, demonstrating enhanced efficiency and accuracy.

Exploring the Self-Contrast Mixture-of-Experts (SCMoE) Model

Introduction to Mixture-of-Experts (MoE) Models

In the world of LLMs, balancing the computational cost with model performance is paramount. Mixture-of-Experts (MoE) has become a widely recognized approach to address this challenge. Here's the basic idea: instead of utilizing all the parameters in a model for every input token, MoE models activate a subset of parameters, known as "experts," according to a routing mechanism.

Think of it like going to a hospital where specialists (experts) handle different tasks based on what you need. But in the MoE model's case, some "specialists" are left idle during the diagnosis process. This leading to the question—can we better utilize these unchosen experts?

What SCMoE Brings to the Table

Researchers have found that just activating more experts doesn't always result in better performance and can sometimes even degrade the output quality. This discovery led to the introduction of Self-Contrast Mixture-of-Experts (SCMoE). The essence of SCMoE is simple yet clever: take advantage of unchosen experts at the inference stage to refine predictions without increasing the complexity or computation time significantly.

How SCMoE Works

Method: The Nuts and Bolts

SCMoE leverages the discrepancies in the outputs from different routing strategies. Specifically, it contrasts the outputs from a strong activation (e.g., top-2 experts) with a weak activation (e.g., rank-2 expert). This method is both conceptually intuitive and computationally efficient.

Here's a step-by-step breakdown:

  1. MoE Model Structure:

    • Typical MoE models consist of a router and multiple experts. The router decides which experts to activate based on the input.
  2. Top-2 Routing:

    • The default routing strategy (strong activation) where the top 2 experts with the highest scores are activated.
  3. Rank-k Routing:

    • A weak activation strategy where the k-th ranked expert (based on initial scores) is activated.
  4. SCMoE Implementation:

    • During inference, SCMoE utilizes both the top-2 and rank-k routing outputs.
    • It contrasts these outputs by adjusting the next-token probabilities based on the difference between the strong and weak activation logits.
  5. Performance Boost:

    • This contrastive approach significantly enhances reasoning capabilities across multiple benchmarks. For example, SCMoE improved the accuracy of the Mixtral 8x7B model on the GSM8K benchmark from 61.79 to 66.94, and further improvements were observed when it was combined with self-consistency strategies.

Experimental Insights

Results that Stand Out

The experiments with the Mixtral 8x7B model using SCMoE presented some compelling results:

  • GSM8K: Accuracy improved from 61.79 to 66.94.
  • StrategyQA: Accuracy improved from 72.83 to 76.29.
  • MBPP (code generation): Pass@1 accuracy increased from 46.20 to 48.80.
  • HumanEval: Pass@1 accuracy jumped from 33.54 to 41.46.

These improvements might seem modest at first glance, but they represent significant strides in model performance, especially since they were achieved without increasing the computational complexity substantially.

Practical Implications

Beyond proving efficacy, SCMoE also showcased minimal latency overhead—just 1.30x compared to greedy decoding—making it a feasible alternative for real-world applications where both efficiency and accuracy are critical.

Beyond SCMoE: Implications and Future Directions

SCMoE demonstrates that unchosen experts can indeed be leveraged effectively, contradicting the assumption that more activation equals better performance. This has several important implications:

  1. Enhanced Efficiency:

    • SCMoE provides a way to utilize existing model capacity more efficiently without needing additional computation resources.
  2. Model Adaptability:

    • It presents a potential pathway for adapting similar self-contrast techniques to other MoE-based models, such as DeepSeekMoE-16B.
  3. Future Research:

    • Further exploration into optimizing strong activation strategies or combining SCMoE with self-consistency could yield even better results.

Conclusion

The Self-Contrast Mixture-of-Experts (SCMoE) model offers an innovative approach to improve MoE model performance by smartly utilizing unchosen experts. Its application has shown promising results across various benchmarks without added computational burden, highlighting its practical relevance. As MoE models continue to evolve, techniques like SCMoE can play a crucial role in shaping more efficient and powerful AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.