Emergent Mind

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

(2406.11931)
Published Jun 17, 2024 in cs.SE , cs.AI , and cs.LG

Abstract

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

DeepSeek-Coder-V2's performance on math and code benchmarks.

Overview

  • DeepSeek-Coder-V2 introduces a state-of-the-art Mixture-of-Experts (MoE) code language model trained on a vast dataset, demonstrating improvements in coding and mathematical reasoning.

  • The research showcases significant performance benchmarks, including HumanEval and MBPP, rivaling closed-source models like GPT4-Turbo and Claude 3 Opus.

  • Publicly released under a permissive license, DeepSeek-Coder-V2 offers extensive multilingual support, enhancing both academic and commercial applications in code intelligence.

DeepSeek-Coder-V2: Advancing Open-Source Models in Code Intelligence

Overview

The paper introduces DeepSeek-Coder-V2, a cutting-edge Mixture-of-Experts (MoE) code language model that aims to bridge the performance gap between open-source and closed-source models in the realm of code intelligence. DeepSeek-Coder-V2 is notably pre-trained on an extensive dataset of over 6 trillion tokens, resulting in substantial enhancements in coding and mathematical reasoning capabilities while maintaining competitive performance in general language tasks.

Contributions and Key Findings

DeepSeek-Coder-V2, particularly in its 236B parameter variation, exhibits performance levels comparable to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro. Key components and contributions outlined in the paper include:

Enhanced Pre-Training Dataset:

  • The pre-training corpus consists of 60% source code, 10% math-related tokens, and 30% natural language text, sourced from GitHub, CommonCrawl, and other pertinent datasets.
  • The corpus expands the support from 86 to 338 programming languages, increasing the versatility of the model.

Technical Specifications:

  • DeepSeek-Coder-V2 is built on the DeepSeekMoE framework and features both a 16B (2.4B active parameters) and 236B (21B active parameters) parameter models.
  • The context length is extended to 128K tokens, facilitating the handling of more complex code-related tasks.

Key Performance Metrics:

  • DeepSeek-Coder-V2 achieves 90.2% on HumanEval, 76.2% on MBPP, and 43.4% on LiveCodeBench, setting new benchmarks for open-source code models.
  • In mathematical reasoning, it reaches 75.7% on MATH, nearly matching the state-of-the-art from GPT-4o, and surpasses performance on AIME 2024 competition-level benchmarks.

Implications

The research has several important implications:

Practical Developments:

  • By publicly releasing DeepSeek-Coder-V2 models under a permissive license, the paper makes significant contributions to the accessibility of high-performance code intelligence tools for both academic and commercial purposes.
  • The extensive support for a wide array of programming languages and the increased context length enhance the applicability across several programming scenarios, from simple script completion to complex, multi-file repository management.

Theoretical Advancements:

  • The paper discusses the architectural features and training strategies that contribute to the model's performance, offering insights that may inform the design and training of future MoE models.
  • The practical application of reinforcement learning techniques, such as Group Relative Policy Optimization (GRPO), and fine-tuning with Fill-In-Middle (FIM) approach, demonstrate effective methods for aligning LLMs with human preferences and improving their code completion capabilities.

Future Directions

While DeepSeek-Coder-V2 demonstrates substantial improvements, the paper identifies areas for further research:

  • Enhancement of Instruction-Following Capabilities: Despite its strong performance, DeepSeek-Coder-V2 has room for improvement in instruction-following tasks, essential for more complex real-world programming scenarios.
  • Balanced Training for General Language Tasks: Emphasis on maintaining or improving natural language understanding and generation capabilities alongside specialized code intelligence.

Conclusion

DeepSeek-Coder-V2 represents a significant advancement in open-source code intelligence models, underscoring the potential for open-source initiatives to rival, and even surpass, their closed-source counterparts in certain specialized tasks. The continued focus on refining training datasets, architectural decisions, and reinforcement learning alignment methods holds promise for future developments in this domain, bringing us closer to robust, general-purpose AI systems capable of tackling a broad spectrum of programming and reasoning tasks. Finally, the implications for commercial and academic use cases are profound, opening up new avenues for innovation and efficiency in software development practices.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube