Papers
Topics
Authors
Recent
2000 character limit reached

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (2401.02954v1)

Published 5 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The rapid development of open-source LLMs has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source LLMs with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  2. Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  5. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press, 2020. 10.1609/aaai.v34i05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
  6. Language models are few-shot learners, 2020.
  7. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  8. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. T. Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  11. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  12. T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  13. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  14. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  15. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2368–2378. Association for Computational Linguistics, 2019. 10.18653/V1/N19-1246. URL https://doi.org/10.18653/v1/n19-1246.
  16. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  17. Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/.
  18. Tora: A tool-integrated reasoning agent for mathematical problem solving. CoRR, abs/2309.17452, 2023. 10.48550/ARXIV.2309.17452. URL https://doi.org/10.48550/arXiv.2309.17452.
  19. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  20. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  21. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  22. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  23. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  24. High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm.
  25. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. 10.48550/ARXIV.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
  26. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
  27. Huggingface Team. Tokenizers: Fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers.
  28. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR3wGCk-IXp.
  29. Camels in a changing climate: Enhancing lm adaptation with tulu 2. 2023.
  30. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  31. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  32. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  33. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
  34. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://doi.org/10.1162/tacl_a_00276.
  35. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  36. RACE: large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785–794. Association for Computational Linguistics, 2017. 10.18653/V1/D17-1082. URL https://doi.org/10.18653/v1/d17-1082.
  37. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023.
  38. Ccpm: A chinese classical poetry matching dataset, 2021.
  39. Alignbench: Benchmarking chinese alignment of large language models. CoRR, abs/2311.18743, 2023. 10.48550/ARXIV.2311.18743. URL https://doi.org/10.48550/arXiv.2311.18743.
  40. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  41. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  42. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
  43. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
  44. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  45. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  46. OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  47. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  48. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  49. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  50. Direct preference optimization: Your language model is secretly a reward model. 2023.
  51. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  52. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  53. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49, 2019.
  54. N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  55. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  56. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  57. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
  58. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  59. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019.
  60. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  61. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  62. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  63. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  64. Do-not-answer: A dataset for evaluating safeguards in llms. CoRR, abs/2308.13387, 2023. 10.48550/ARXIV.2308.13387. URL https://doi.org/10.48550/arXiv.2308.13387.
  65. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
  66. Cmath: Can your language model pass chinese elementary school math test?, 2023.
  67. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762–4772. International Committee on Computational Linguistics, 2020. 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419.
  68. Baichuan 2: Open large-scale language models. Technical report, Baichuan Inc., 2023. URL https://cdn.baichuan-ai.com/paper/Baichuan2-technical-report.pdf.
  69. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284, 2023. 10.48550/ARXIV.2309.12284. URL https://doi.org/10.48550/arXiv.2309.12284.
  70. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p19-1472.
  71. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  72. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019.
  73. Chid: A large-scale chinese idiom dataset for cloze test. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 778–787. Association for Computational Linguistics, 2019. 10.18653/V1/P19-1075. URL https://doi.org/10.18653/v1/p19-1075.
  74. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023.
  75. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
  76. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
Citations (187)

Summary

  • The paper presents DeepSeek LLM which scales open-source language models using scaling laws derived from non-embedding FLOPs/token metrics.
  • It employs a novel multi-step learning rate scheduler and advanced techniques like Rotary Embedding and Grouped-Query Attention to optimize performance.
  • The study demonstrates that balancing model and data scaling through computed IsoFLOP curves yields superior benchmarks compared to models such as LLaMA-2 70B.

DeepSeek LLM: Scaling Open-Source LLMs with Longtermism

The paper presents "DeepSeek LLM," a project focused on the scaling of open-source LLMs, with a specific emphasis on models of two different scales: 7 billion (7B) and 67 billion (67B) parameters. The paper is centered around the derivation and application of scaling laws intended to guide the effective scaling of LLMs within computational and data constraints.

Introduction to Scaling Laws and DeepSeek LLM

The paper begins by addressing the variability in conclusions from prior research on scaling laws, which are critical for understanding the relationship between model performance, model size (NN), dataset size (DD), and compute budget (CC). These scaling laws guide developers in optimizing resource allocation among models and data as compute budgets increase, aiming for AGI.

DeepSeek LLM contributes to this field by exploring these scaling behaviors to build models that might surpass existing standards like LLaMA in various benchmark settings. By pre-training on a large dataset with 2 trillion tokens and using strategic fine-tuning approaches, DeepSeek aims to construct models that excel in diverse tasks, outperforming competitors such as LLaMA-2 70B in metrics for code, math, and reasoning.

Pre-Training and Model Architecture

Data Processing

Deduplication, filtering, and remixing are crucial stages in ensuring data quality and diversity in training datasets. The deduplication strategy highlighted has demonstrated the importance of cross-dump analysis in removing duplicates more effectively than within single data dumps. This helps in maximizing the uniqueness and relevance of training instances.

Model Architecture

DeepSeek LLM adopts architectural designs akin to LLaMA's Pre-Norm structure, employing RMSNorm and SwiGLU for normalization and activation functions, respectively, while incorporating advanced techniques like Rotary Embedding and Grouped-Query Attention (GQA) for optimized inference. The distinctions lie in the macro design where DeepSeek models modify layer counts to stratify resource distribution effectively.

Detailed Model Specifications:

Parameters are adjusted, with 7B models utilizing 30 layers and 67B models using 95 layers, balancing depth and inference cost-effective scalability.

Hyperparameter Optimization

The paper emphasizes the choice of hyperparameters, highlighting the use of a novel multi-step learning rate scheduler over the conventional cosine scheduler. Figure 1

Figure 1

Figure 1: Training loss curves with different learning rate schedulers or different parameters for schedulers. The model size is 1.6 billion parameters, trained on a dataset of 100 billion tokens.

The multi-step LR scheduler was shown to be beneficial in allowing consistent performance across varied training scales without necessitating complete retraining, which enhances its utility in incremental training scenarios.

Scaling Laws Evaluation

Deriving Optimal Model/Data Scaling Strategies

DeepSeek LLM investigates the representation of model scale using non-embedding FLOPs/token, aiming for precision in scaling predictions and thus improving on traditional model parameter metrics that include potential computational overhead biases. Figure 2

Figure 2

Figure 2

Figure 2: IsoFLOP curve and optimal model/data allocation. The metric in IsoFLOP curve is bits-per-byte on the validation set. The dotted lines in optimal model/data scaling curves represent the power law fitting the smaller model (grey circles).

The IsoFLOP method allows efficient fitting of scaling curves, providing insight into optimal allocation strategies between model size and data scale. Figure 3

Figure 3: Performance scaling curve. The metric is the bits-per-byte on the validation set. The dotted line represents the power law fitting the smaller model (grey circles). The blue stars represent DeepSeek LLM 7B and 67B. Their performance is well-predicted by the scaling curve.

The performance of DeepSeek LLM 7B and 67B models aligns with the predictions from these scaling laws, showcasing its applicability for scaling guidance.

Impact of Data Quality

Analysis of data sets indicated that higher-quality data necessitates allocating more resources towards model scaling rather than data expansion.

Alignment and Fine-Tuning

The DeepSeek project explores strategies like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to enhance model capabilities in dialogue and open-ended task settings. This phase emphasizes the transition from model pre-training to tailored applications in chat models, leading to superior open-field performance compared to contenders like GPT-3.5.

Evaluation Results

DeepSeek LLM outperforms other models such as LLaMA-2 70B in multiple benchmarks, emphasized by a more significant performance leap with the 67B model, suggesting notable sophistication gains with scale. This is further confirmed through evaluations on standardized tests, math reasoning benchmarks, and in-house test datasets.

Conclusion

DeepSeek LLM marks a significant step towards refining open-source LLMs with a focus on scalable, guided improvements. Its insights into data and model scaling laws not only enhance performance in benchmark settings but also optimize resource use for AGI progression. Future iterations aim to expand dataset quality and explore advanced alignment strategies, indicating ongoing contributions to open-source AI research.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 34 tweets with 1019 likes about this paper.