Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Mixtral of Experts (2401.04088v1)

Published 8 Jan 2024 in cs.LG and cs.CL

Abstract: We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) LLM. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Citations (737)

Summary

  • The paper introduces Mixtral 8x7B, a SMoE model that scales to 47B parameters while activating only 13B per inference.
  • The paper employs a Transformer architecture with dynamic Top-K routing that efficiently processes tokens through select expert modules.
  • The paper demonstrates superior performance in mathematics, code generation, and multilingual tasks, while excelling in long-range and instructional capabilities.

Mixtral of Experts: A Comprehensive Analysis

This paper introduces Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) LLM, designed to outperform prominent models such as Llama 2 70B and GPT-3.5 across a multitude of benchmarks, specifically excelling in mathematical, code generation, and multilingual domains. The SMoE architecture empowers fast inference and efficient parameter utilization, promising potential advancements in AI model architectures.

Architectural Overview

Mixtral employs a Mixture of Experts (MoE) layer within its Transformer-based architecture, which enhances computational efficiency by dynamically routing input tokens through select expert modules. In each layer, a router algorithm assigns tokens to two out of a possible eight experts, providing access to 47B parameters while only activating 13B parameters per inference step. Figure 1

Figure 1: Mixture of Experts Layer. Each input vector is assigned to 2 of the 8 experts by a router.

Each expert functions as a feedforward block akin to those in standard transformer architecture. The gating network uses a Top-K selection mechanism, implemented via softmax over linear layer logits, which ensures that only a subset of experts are activated for efficient processing. This topology allows models to significantly scale parameter count without a commensurate increase in computational demands.

Performance Evaluation

Mixtral's efficacy is demonstrated across various competitive benchmarks, where it either matches or exceeds competitors such as Llama 2 70B, especially excelling in mathematics and code generation tasks. The model achieves this with a reduced number of active parameters, effectively lowering the computational burden. Figure 2

Figure 2: Performance of Mixtral and different Llama models on a wide range of benchmarks.

Detailed performance analysis illustrates Mixtral's advantages in multilingual capabilities and context retention, maintaining a high degree of accuracy across diverse tasks and languages. This was further substantiated through comparisons with GPT-3.5 and Llama 2 70B, where Mixtral showed superior efficiency in parameter utilization and task execution.

Long-Range and Instructional Abilities

Mixtral's long-range capabilities were analyzed using the passkey retrieval task, yielding 100% retrieval accuracy regardless of the passkey's position or sequence length. This highlights the architecture's effective context management over extensive input sequences. Figure 3

Figure 3

Figure 3: Long range performance of Mixtral.

Moreover, the development of Mixtral Instruct, a model fine-tuned for instructional capabilities, positioned it ahead of models like Claude-2.1 and GPT-3.5 Turbo in terms of human evaluation performance. This was achieved through Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enhancing the model's ability to process and interpret directives.

Expert Routing and Specialization

In-depth analyses reveal that the expert routers demonstrate a surprising lack of domain specialization, with experts being assigned based on syntactic rather than semantic content, as evidenced through various test datasets. Figure 4

Figure 4: Proportion of tokens assigned to each expert on different domains from The Pile dataset for layers 0, 15, and 31.

The observed patterns suggest a structured syntactic behavior rather than thematic specialization, with experts repeatably handling specific token sequences based on their structural properties.

Conclusion

Mixtral 8x7B represents a notable advancement in SMoE architectures, achieving competitive performance with reduced active parameters and efficient routing strategies. Its open-source release under the Apache 2.0 license is expected to foster broader application and further research innovation in sparse computation techniques. This paper underscores the potential these models hold for the development of scalable, efficient machine learning solutions capable of addressing diverse and demanding AI challenges.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews