DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2405.04434v5)

Published 7 May 2024 in cs.CL and cs.AI

Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) LLM characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

Citations (184)

View on Semantic Scholar

Summary

The paper presents an efficient MoE language model that activates only 21B of its 236B parameters per token, reducing training costs by 42.5%.
The model uses Multi-head Latent Attention to compress the KV cache, boosting inference throughput by up to 5.76 times and supporting a 128K token context.
Its innovative DeepSeekMoE architecture optimizes expert utilization and load balancing, establishing a blueprint for cost-effective, multilingual LLMs.

Deep Dive into DeepSeek-V2: A Boost in Model Efficiency and Performance

Introduction to DeepSeek-V2

DeepSeek-V2 introduces a sophisticated advancement in LLMs, specifically tackling the challenges around training costs and inference efficiencies that many existing LLMs face. This model, harboring a whopping 236 billion parameters of which only 21 billion are activated for each token, leverages a Mixture-of-Experts (MoE) framework to offer not just economical training but also efficient inference, all while supporting a generous context length of 128K tokens.

Architectural Innovations

Multi-head Latent Attention (MLA)

The standout feature of DeepSeek-V2 is its innovative attention mechanism known as Multi-head Latent Attention (MLA). This mechanism significantly reduces the inference-time Key-Value (KV) cache, a notorious bottleneck for traditional Multi-Head Attention (MHA) systems. MLA employs low-rank key-value joint compression, meaning it needs less memory for keys and values during inference, effectively boosting the maximum batch size and throughput:

Attention Cache Efficiency: The model reduces the KV cache required by almost a tenth of what is traditionally necessary, marking a significant stride in making LLMs more manageable and practical in deployment scenarios.

DeepSeekMoE: Economical and Potent Training

The model adopts the DeepSeekMoE architecture for its Feed-Forward Networks (FFNs), which emphasizes expert segmentation for refined knowledge specialization and optimizes routing to balance training loads efficiently. The architecture allows DeepSeek-V2 to outperform other MoE models significantly:

Expert Utilization: With finely segmented experts and controlled load distribution mechanisms, the model ensures that no compute power is wasted, which is often a risk with other complex MoE systems.

Surpassing Benchmarks

DeepSeek-V2 doesn't just theoretically impress but also empirically outshines competitors. It demonstrates top-tier performance across various benchmarks in both English and Chinese, overwhelming its predecessor DeepSeek 67B and even outperforming it in training efficiency by saving 42.5% in training costs. Additionally, DeepSeek-V2 excels in maintaining inference throughput, which is boosted up to 5.76 times compared to the earlier model.

Implications and Future Directions

The introduction of DeepSeek-V2 opens several pathways and considerations for future AI developments:

Balancing Cost and Performance: The techniques utilized in DeepSeek-V2, from sparse activation to efficient attention mechanisms, provide a blueprint for developing powerful yet cost-effective LLMs.
Cross-Linguistic Capabilities: Its prowess in handling both English and Chinese languages at scale indicates a promising direction for creating multilingual models without compromising on performance.
Potential in Real-World Applications: The remarkable context length support and the reduced computational overhead make DeepSeek-V2 a robust candidate for integration into complex AI systems, from automated chatbots to intricate analytical tools.

Concluding Thoughts

DeepSeek-V2 is a compelling iteration in the evolution of LLMs, emphasizing efficiency without sacrificing the breadth and depth of linguistic understanding. While it stands as a milestone, the ongoing challenge remains in further refining these systems to balance performance, cost, and energy consumption, which are critical in the scalable deployment of AI technologies.

PDF Markdown

Related Papers

GitHub

GitHub - deepseek-ai/DeepSeek-V2 (4,921 stars)

Tweets

https://twitter.com/finbarrtimbers/status/1842741311763538214

https://twitter.com/_xjdr/status/1803868388714045885

https://twitter.com/_philschmid/status/1789225816133398537

https://twitter.com/TheTuringPost/status/1792911116794032266

https://twitter.com/TheTuringPost/status/1790627477435691013

https://twitter.com/xlr8harder/status/1788458841786786087