DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence (2401.14196v2)

Published 25 Jan 2024 in cs.SE, cs.CL, and cs.LG

Abstract: The rapid development of LLMs has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

Authors (13)

Daya Guo (37 papers)
Qihao Zhu (27 papers)
Dejian Yang (11 papers)
Zhenda Xie (51 papers)
Kai Dong (15 papers)
Wentao Zhang (261 papers)
Guanting Chen (19 papers)
Xiao Bi (8 papers)
Y. Wu (640 papers)
Y. K. Li (16 papers)
Fuli Luo (23 papers)
Yingfei Xiong (33 papers)
Wenfeng Liang (9 papers)

Citations (397)

View on Semantic Scholar

Summary

The paper presents state-of-the-art open-source code LLMs that bridge the gap with proprietary models through extensive pre-training on 2 trillion tokens.
It employs rigorous data curation with dependency parsing, near-deduplication, and multi-lingual code sources across 87 programming languages to ensure high-quality training data.
Instruction tuning and advanced Transformer techniques such as RoPE and GQA enable robust performance on benchmarks like HumanEval, MBPP, and LeetCode contests.

This paper introduces DeepSeek-Coder, a series of open-source LLMs specifically designed for code-related tasks. The models range in size from 1.3B to 33B parameters and are trained from scratch on a massive 2 trillion token dataset. The primary motivation is to bridge the performance gap between powerful but proprietary closed-source models (like GPT-4) and existing open-source code models, making state-of-the-art code intelligence more accessible for research and commercial use.

Key Innovations and Training Details:

Data Curation: The training data comprises 87% source code (across 87 programming languages), 10% English code-related text (GitHub Markdown, StackExchange), and 3% Chinese text. A rigorous data processing pipeline was employed:
- Crawling & Filtering: Initial collection from public GitHub repositories followed by rule-based filtering (line length, alphabetic character ratio, etc.) similar to StarCoder, significantly reducing data volume.
- Dependency Parsing: A novel approach using topological sort (Algorithm 1) based on inferred dependencies (e.g., import, include) to structure code at the repository level. Files are concatenated in an order respecting dependencies, aiming to improve the model's understanding of project-level context. File path information is included as comments.
- Repo-Level Deduplication: Near-deduplication is applied to entire concatenated repositories (rather than individual files) to preserve structural integrity while removing redundancy.
- Quality Screening & Decontamination: Further filtering using compilers, quality models, and heuristics. N-gram filtering is used to remove potential contamination from benchmark datasets like HumanEval and MBPP.
Training Objectives:
- Next Token Prediction: Standard LLMing objective.
- Fill-in-the-Middle (FIM): Employed to enhance code completion and infilling capabilities. Ablation studies (Figure 3) led to choosing a 50% FIM rate using the PSM (Prefix-Suffix-Middle) mode over SPM or MSP, balancing infilling performance and code completion ability. FIM is applied at the document level before packing, using special sentinel tokens.
Model Architecture: Based on the DeepSeek LLM architecture, these are decoder-only Transformers using Rotary Position Embeddings (RoPE) and SwiGLU activations. The 33B model incorporates Grouped-Query Attention (GQA) for efficiency. FlashAttention v2 is used for faster attention computation. Specific hyperparameters (hidden size, layers, heads) are provided for each model size (Table 3).
Long Context: RoPE parameters were adjusted using linear scaling (factor 4, base frequency 100k) and additional training (1000 steps) to extend the effective context window to 16K tokens, beneficial for repository-level understanding.
Instruction Tuning: Base models were fine-tuned on high-quality, Alpaca-formatted instruction data to create the DeepSeek-Coder-Instruct variants, improving their ability to follow instructions and engage in dialogue (Figure 4).

Evaluation and Results:

DeepSeek-Coder models were evaluated extensively across various benchmarks:

Code Generation:
- HumanEval (Multilingual) & MBPP: DeepSeek-Coder-Base 33B achieved state-of-the-art results among open-source models, significantly outperforming CodeLlama-34B (Table 4). DeepSeek-Coder-Instruct 33B surpassed GPT-3.5-Turbo on HumanEval.
- DS-1000: Demonstrated strong performance in generating code for practical data science tasks across libraries like Pandas, NumPy, PyTorch, etc. (Table 5).
- LeetCode Contest: A new benchmark created from recent LeetCode problems. DeepSeek-Coder-Instruct 33B outperformed GPT-3.5-Turbo, showing strong problem-solving skills. Chain-of-Thought (CoT) prompting further improved performance on complex problems (Table 6).
Fill-in-the-Middle (FIM): On single-line infilling benchmarks, DeepSeek-Coder models (even the 1.3B version) showed competitive or superior performance compared to StarCoder and CodeLlama, validating the FIM training strategy (Table 7).
Cross-File Code Completion: Using the CrossCodeEval benchmark, DeepSeek-Coder-Base 6.7B outperformed similar-sized models, especially when augmented with retrieval. An ablation paper confirmed the benefit of repository-level pre-training (Table 8).
Program-based Math Reasoning: Using the PAL method on benchmarks like GSM8K and MATH, DeepSeek-Coder models showed strong mathematical reasoning capabilities when generating code to solve problems (Table 9).

DeepSeek-Coder-v1.5:

The paper also introduces DeepSeek-Coder-v1.5 7B, created by continuing the pre-training of a general DeepSeek-LLM 7B base model on a mix rich in code (70%), math, and code-related natural language. This version used only next-token prediction with a 4K context. Compared to the original DeepSeek-Coder 6.7B, v1.5 showed significantly improved natural language understanding and math reasoning performance, with only a minor dip in coding benchmarks (Table 11), suggesting that grounding code models in strong general-purpose LLMs is beneficial.

Conclusion:

DeepSeek-Coder models represent a significant advancement in open-source code LLMs, achieving state-of-the-art performance through careful data curation (especially repository-level processing), optimized training strategies (including FIM), and large-scale training (2T tokens). The models excel in code generation, infilling, cross-file understanding, and math reasoning. The instruction-tuned versions are highly capable dialogue agents for coding tasks. The development of v1.5 highlights the value of building specialized models on strong generalist foundations. The models are released under a permissive license for both research and commercial use.