Emergent Mind

Abstract

By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of LLMs without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve model's parameter efficiency. We validate the effectiveness of our method by pruning two state-of-the-art MoE models, Mixtral-8x7B and Mixtral-8x22B. Evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. To facilitate future research, we will release our code and the pruned MoE models.

MoE layer pruning by merging experts improves efficiency, preserves knowledge, and reduces computation and storage.

Overview

  • The paper introduces a task-agnostic pruning method for Mixture-of-Experts (MoE) models to enhance parameter efficiency and reduce memory costs in LLMs.

  • The proposed method involves two main stages: using Centered Kernel Alignment (CKA) for expert similarity estimation, and grouping and merging similar experts to minimize redundancy.

  • Extensive experiments on state-of-the-art models like Mixtral-8x7B and Mixtral-8x22B show that the method outperforms existing pruning techniques, maintaining high performance while reducing the number of experts.

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

The paper under review presents a novel methodology for pruning Mixture-of-Experts (MoE) models in a task-agnostic manner, addressing a critical challenge in the deployment of LLMs. The authors propose a method to enhance parameter efficiency by grouping and pruning redundant experts within MoE layers. This approach is empirically validated on state-of-the-art models such as Mixtral-8x7B and Mixtral-8x22B, demonstrating superior performance over existing pruning techniques.

Introduction

LLMs have shown significant advancements by scaling parameters through architectures like the sparsely-activated MoE. Despite their high performance, the large number of experts in MoE models incurs substantial memory costs, impeding their practicability in real-world applications. This paper introduces a pruning method that does not rely on task-specific information, thereby being more versatile and broadly applicable.

Methodology

The core of the proposed method revolves around identifying and pruning redundant experts in a task-agnostic fashion. The approach comprises two main stages:

  1. Expert Similarity Estimation:

    • Utilize the Centered Kernel Alignment (CKA) to quantify the similarity between experts within the same MoE layer. This metric captures how similarly different experts respond to the same input data.
  2. Pruning and Merging Experts:

    • Group similar experts into clusters based on a graph partitioning algorithm. Each group of similar experts is then merged into a single expert, along with their corresponding routing weights.

    This two-step strategy helps retain as much of the original knowledge encoded in the experts while reducing redundancy and memory usage.

Experimental Results

The authors conducted extensive experiments to validate their method. The main evaluation metrics were zero-shot performance on standardized datasets like MMLU, BoolQ, OpenBookQA, and RTE. The key results are summarized as follows:

  • Mixtral-8x7B: The proposed methods outperform existing pruning strategies by an average margin of 1.5%, maintaining a competitive performance despite a reduction in the number of experts.
  • Mixtral-8x22B: The approach using surrogate weight representations achieves the best results, with only a 2.8% average performance drop compared to the full model.

Empirical Analysis

The paper also provides a detailed empirical analysis of expert behavior before and after pruning. By comparing the visiting frequency of tokens among experts, the authors illustrate the effectiveness of their pruning method in reducing expert redundancy while preserving task-specific knowledge diversity.

Implications and Future Work

The implications of this research are significant for the deployment of LLMs in resource-constrained environments. By efficiently pruning redundant experts, this method paves the way for more practical and scalable applications of LLMs without substantial performance degradation. Future research could explore adaptive pruning strategies that dynamically adjust the number of experts based on task requirements and computational constraints.

Conclusion

The proposed task-agnostic pruning method effectively addresses the challenge of memory consumption in sparse MoE architectures. By discovering and merging similar experts, this approach not only reduces memory usage but also maintains high performance across various tasks. This contribution is valuable for enhancing the practicality of deploying large-scale models in diverse settings.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.