Mixture of Experts with Mixture of Precisions for Tuning Quality of Service (2407.14417v2)

Published 19 Jul 2024 in cs.DC, cs.AI, cs.LG, and cs.PF

Abstract: The increasing demand for deploying large Mixture-of-Experts (MoE) models in resource-constrained environments necessitates efficient approaches to address their high memory and computational requirements challenges. Moreover, given that tasks come in different user-defined constraints and the available resources change over time in multi-tenant environments, it is necessary to design an approach which provides a flexible configuration space. This paper presents an adaptive serving approach for the efficient deployment of MoE models, capitalizing on partial quantization of the experts. By dynamically determining the number of quantized experts and their distribution across CPU and GPU, our approach explores the Pareto frontier and offers a fine-grained range of configurations for tuning throughput and model quality. Our evaluation on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language modelling benchmarks demonstrates that the throughput of token generation can be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a marginal perplexity increase of 3.81 to 4.00, 13.59 to 14.17, and 7.24 to 7.40 for WikiText2, PTB, and C4 datasets respectively under maximum quantization. These results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications where both memory usage and output quality are important.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces an adaptive serving strategy that deploys partial expert quantization to tune performance and reduce memory usage.
It demonstrates a balance between token generation throughput and minimal perplexity increases using the Mixtral 8x7B MoE model.
The proposed method efficiently partitions inference tasks between GPU and CPU for dynamic resource adaptation in constrained environments.

Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

Introduction

Mixture-of-Experts (MoE) architectures have become a critical component in improving performance for NLP tasks. These models harness multiple parallel feed-forward (FF) layers, known as experts, allowing them to scale significantly in size and outperform dense models through increased specialization. However, deploying these models for inference presents challenges due to their massive memory and computational demands. This paper addresses these issues by introducing an adaptive serving approach for deploying MoE models in resource-constrained environments. This method leverages partial quantization of experts and adapts to varying user-defined constraints and fluctuating resources. The approach is demonstrated using a Mixtral 8x7B MoE model, showcasing adjustable token generation throughput and minimal increases in perplexity across several language modeling datasets.

Adaptive Inference Partitioner and Planner

The adaptive serving approach introduces a novel partitioning and planning system that adjusts the deployment of MoE models in multi-tenant environments, characterized by dynamic resource allocation and fluctuating constraints. The system targets the effective management of GPU memory by leveraging partial expert quantization and dynamic allocation between GPU and CPU. This method enables a balance between computational throughput and output quality, as evidenced by the ability to fine-tune token generation rates from 0.63 to 13.00 tokens per second, depending on available memory and user preferences.

Figure 1: An adaptive inference partitioner and planner for deployment of MoE models.

Evaluation

Experimental Setup

Evaluation is performed using the Mixtral 8x7B MoE model and benchmarks on three datasets: WikiText2, PTB, and C4. Tests are executed on an NVIDIA A100 GPU with specific emphasis on memory management and expert quantization to assess the approach's efficiency in maintaining output quality with minimal memory usage.

Results

The results, demonstrated across various configurations, illustrate the trade-offs between memory usage and model output quality. Partial quantization offers significant memory footprint reduction with negligible decreases in performance, as seen in perplexity metrics. The experiment demonstrates that perplexity slightly increases with the number of 4-bit experts, revealing the limitations of partial expert quantization and its impact on model effectiveness.

Figure 2: Perplexity of the expert-only partially quantized model across varying numbers of 4-bit experts (out of a total of 256 experts).

Moreover, the throughput analysis shows how efficient expert offloading can drastically improve performance under constrained GPU memory scenarios. The tested configurations confirm hyperbolic growth in throughput with increased memory and quantization.

Figure 3: Throughput of an expert-only partially quantized Mixtral 8x7B MoE model running on an NVIDIA A100 GPU under different amounts of available memory.

Conclusion

This paper presents an adaptive serving strategy that addresses critical deployment challenges for MoE models in constrained environments. By employing a mixture of precisions through partial expert quantization, it enables nuanced control over performance metrics like throughput and perplexity, appropriate in dynamic settings with shifting resource constraints. These findings are pivotal for practical applications in settings with limited computational resources, illustrating the potential for continued research and refinement in model deployment methodologies for large-scale NLP systems.