- The paper demonstrates that training on 10.2 trillion tokens with a Mixture-of-Experts architecture significantly boosts open-source code intelligence performance.
- It details dual training strategies using Next-Token-Prediction and Fill-In-Middle objectives, achieving high scores on benchmarks like HumanEval and MBPP+.
- The model leverages a diverse corpus of source code, math, and language with extended context lengths, narrowing the gap with top closed-source models.
DeepSeek-Coder-V2 is an open-source Mixture-of-Experts (MoE) LLM designed for code intelligence, aiming to bridge the performance gap with state-of-the-art closed-source models like GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro. The model is built upon the DeepSeek-V2 architecture and undergoes further pre-training on an additional 6 trillion tokens, totaling 10.2 trillion tokens. This continued pre-training significantly boosts its capabilities in coding and mathematical reasoning while preserving general language performance.
The pre-training dataset for DeepSeek-Coder-V2 consists of a multi-source corpus with a composition of 60% source code, 10% math, and 30% natural language. The source code corpus comprises 1,170 billion tokens collected from GitHub and CommonCrawl, expanding the coverage from 86 to 338 programming languages compared to the previous DeepSeek-Coder model (2401.14196). The math corpus includes 221 billion tokens from CommonCrawl. The natural language data is sampled from the DeepSeek-V2 training corpus (2405.04434).
DeepSeek-Coder-V2 comes in two sizes based on the DeepSeekMoE framework (2401.06066): a 16 billion total parameter version (Lite) with 2.4 billion active parameters and a 236 billion total parameter version with 21 billion active parameters. The MoE architecture allows for efficient inference by activating only a subset of parameters per token.
The training strategy involves two objectives for the 16B Lite model: Next-Token-Prediction and Fill-In-Middle (FIM) using the PSM (Prefix, Suffix, Middle) mode at a rate of 0.5. The 236B model uses only the Next-Token-Prediction objective. Training utilizes the AdamW optimizer (1711.05101) with cosine learning rate decay and warm-up steps. The context length is extended from 16K to 128K tokens using the Yarn method (2309.00071), involving a two-stage training process with increasing sequence lengths. Evaluations using the Needle In A Haystack (NIAH) test confirm strong performance across the extended context window.
Alignment to human preferences and instruction following is achieved through a two-phase process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The SFT dataset combines code, math, and general instruction data. For RL, the Group Relative Policy Optimization (GRPO) algorithm (2402.03300, 2401.06066) is employed. Preference data for RL includes compiler feedback and test cases for code, ground-truth labels for math, and general instruction data. A reward model is trained on the compiler feedback data to provide a more robust training signal than raw compiler output.
DeepSeek-Coder-V2 demonstrates competitive performance across various benchmarks:
- Code Generation: The 236B Instruct model achieves a 90.2% score on HumanEval (2107.03374) and 76.2% on MBPP+ [evalplus], positioning it competitively with top closed-source models and setting a new state-of-the-art for open-source models on MBPP+. It also performs strongly on multilingual HumanEval, LiveCodeBench (tying GPT-4o's overall score of 43.4%), and USACO. The 16B Lite Instruct model also performs well, often surpassing larger open-source counterparts.
- Code Completion: Evaluated on the December subset of RepoBench v1.1 (2312.08932) and Single-Line Infilling tasks. The 16B Lite Base model, despite having only 2.4B active parameters, shows code completion capabilities comparable to much larger models like DeepSeek-Coder-Base 33B on Python and DeepSeek-Coder-Base 7B on Java. Its FIM training contributes to a high mean score (86.4%) on Single-Line Infilling, comparable to or better than other larger models.
- Code Fixing: Tested on Defects4J, SWE-Bench (2310.06770), and Aider benchmarks. The 236B Instruct model shows strong results, achieving 21.0% on Defects4J, 12.7% on SWE-Bench, and an impressive 73.7% on Aider, surpassing all other models tested on Aider.
- Code Understanding and Reasoning: Assessed using CRUXEval (2401.03065). The 236B Instruct model is the top open-source performer but shows a performance gap compared to the best closed-source models, potentially linked to its lower number of active parameters.
- Mathematical Reasoning: Evaluated on GSM8K [gsm8k], MATH (2103.03874), AIME 2024 [AIME], and Math Odyssey [netmindmath] using zero-shot chain-of-thought prompting. The 236B Instruct model achieves 75.7% on MATH and 53.7% on Math Odyssey, comparable to GPT-4o, and solves more AIME 2024 problems than other tested models, highlighting strong mathematical capabilities.
- General Natural Language: Maintains strong general language performance, often outperforming DeepSeek-V2 on reasoning-heavy benchmarks like BBH (2210.09261) and Arena-Hard [arenahard2024], although it may trail slightly on knowledge-intensive tasks due to corpus differences.
DeepSeek-Coder-V2 is released publicly under a permissive license, supporting research and unrestricted commercial use. While achieving performance comparable to top closed-source models on many benchmarks, the paper notes a remaining gap in instruction-following for complex real-world programming tasks like SWE-Bench, identifying this as a key area for future improvement.