Emergent Mind

StarCoder: may the source be with you!

(2305.06161)
Published May 9, 2023 in cs.CL , cs.AI , cs.PL , and cs.SE

Abstract

The BigCode community, an open-scientific collaboration working on the responsible development of LLMs for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

Overview

  • StarCoder and StarCoderBase are LLMs trained on code data, with StarCoderBase having 15.5B parameters and an 8K token context length for efficient large-batch inference.

  • The Stack, a collection of GitHub repositories, provided a 1 trillion token corpus for StarCoderBase, while StarCoder was fine-tuned on 35B Python tokens.

  • StarCoderBase excels in multi-language support, matching OpenAI's code-cushman-001 model, while StarCoder excels in Python and retains multi-language proficiency.

  • The developers have emphasized responsible AI development, with a PII redaction pipeline and tools for tracing code generations to training data to ensure legal compliance.

  • Evaluation strategies for StarCoder cover language understanding, reasoning, and safety aspects, with the model performing well across various benchmarks.

Introduction

The BigCode community has unveiled StarCoder and StarCoderBase, extensive LLMs trained on code data. Featuring 15.5B parameters with an 8K token context length, these models boast infilling capabilities and efficient large-batch inference via multi-query attention. The training corpus for StarCoderBase amounts to 1 trillion tokens sourced from a diverse collection of permissively licensed GitHub repositories known as The Stack. StarCoder is StarCoderBase's fine-tuned counterpart, tailored on 35B Python tokens. A comprehensive evaluation reveals that StarCoderBase surpasses all other open Code LLMs in multiple language support and parallels the performance of OpenAI's code-cushman-001 model. Moreover, StarCoder outshines models fine-tuned on Python while maintaining proficiency in other programming languages.

Model Development

The StarCoder models demonstrate a commitment to responsible development, encompassing copyright respect, privacy protection, and shared community involvement in the development process. Contributing to legal compliance, the PII redaction pipeline has been enhanced and an attribution tool developed, tracing code generations back to training data. Ensuring open access is pivotal to the community-driven approach of the BigCode project. The Stack provides a transparent pre-training dataset with governance tools to verify inclusion and an opt-out process for developers desiring to exclude their code. This effort facilitates external audits and contributions to model improvements and serves as an exemplary open scientific collaboration model.

Empirical Analysis

Evaluation benchmarks the core of Code LLM assessment. The evaluation strategy for StarCoder integrates a diverse array of benchmarks, covering language understanding, reasoning, and toxicity levels. Performance on GSM8K elucidates the reasoning capabilities of StarCoderBase, surpassing similar parameter-sized Code LLMs. Metrics from MMLU and CoQA disclose its language prowess. Meanwhile, RealToxicityPrompts aid in detecting potential biases and toxicity in generated text, an essential safety aspect. StarCoder and StarCoderBase's skilled performance across numerous benchmarks fortifies their staunch positions amid current Code LLMs.

Tools for Safe Deployment

The release of StarCoder models embraces an OpenRAIL-M license, stipulating responsible use restrictions to avert potential misuse in critical scenarios. This initiative addresses the liability by improving transparency and encouraging ethical usage. Augmenting the responsible deployment initiative, new tools for membership checking and a BM25 index search have been published, facilitating users to link model output to training sets effectively. Such tools are pioneering steps towards safeguarding responsible AI deployment, curbing misuse, and bolstering accountability in model-generated code.

In conclusion, the BigCode community's contribution of StarCoder and StarCoderBase represents a significant stride towards the effective and safe application of Code LLMs. With open access, meticulous evaluation, and tools to ensure responsible use, these models stand as beacons of progress while galvanizing community engagement and collaboration.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Unified pre-training for program understanding and generation. In Proceedings of NAACL, 2021. https://aclanthology.org/2021.naacl-main.211.

  2. BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model
  3. Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face
  4. 3:23-cv-00201 N.D. Cal. 2023.
  5. A General Language Assistant as a Laboratory for Alignment
  6. Program Synthesis with Large Language Models
  7. A maximum likelihood approach to continuous speech recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-5:179 – 190, 04 1983. doi: 10.1109/TPAMI.1983.4767370.
  8. Efficient Training of Language Models to Fill in the Middle
  9. BBC. ChatGPT accessible again in Italy. https://www.bbc.com/news/technology-65431914

  10. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, December 2022.

  11. SantaCoder: don’t reach for the stars! In Deep Learning for Code Workshop (DL4C)
  12. A neural probabilistic language model. In T. Leen, T. Dietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. https://proceedings.neurips.cc/paper_files/paper/2000/hash/728f206c2a01bf572b5940d7d9a8fa4c-Abstract.html.

  13. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
  14. BigScience Workshop. BLOOM (revision 4ab0472), 2022. https://huggingface.co/bigscience/bloom.

  15. GPT-NeoX-20B: An Open-Source Autoregressive Language Model
  16. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426, jul 1970. ISSN 0001-0782. doi: 10.1145/362686.362692. https://doi.org/10.1145/362686.362692.
  17. On the Opportunities and Risks of Foundation Models
  18. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.  858–867, Prague, Czech Republic, June 2007. Association for Computational Linguistics. https://aclanthology.org/D07-1090.

  19. Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Annual symposium on combinatorial pattern matching, pp. 1–10. Springer
  20. Language Models are Few-Shot Learners
  21. N-gram counts and language models from the Common Crawl. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp.  3579–3584, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdf.

  22. Matthew Butterick. This CoPilot is stupid and wants to kill me. https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html

  23. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, pp.  1–17, 2023. doi: 10.1109/TSE.2023.3267446. https://arxiv.org/abs/2208.08227.
  24. Evaluating large language models trained on code
  25. PaLM: Scaling Language Modeling with Pathways
  26. Training Verifiers to Solve Math Word Problems
  27. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems
  28. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. https://aclanthology.org/N19-1423.

  29. DOE 1 v. and GitHub, Inc. 4:22-cv-06823 N.D. Cal. 2022.
  30. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
  31. Euronews. Microsoft attracting users to its code-writing, generative AI software. https://www.euronews.com/next/2023/01/25/microsoft-results-ai

  32. European Council. The general data protection regulation. https://www.consilium.europa.eu/en/policies/data-protection/data-protection-regulation/

  33. CodeBERT: A Pre-Trained Model for Programming and Natural Languages
  34. InCoder: A Generative Model for Code Infilling and Synthesis
  35. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  36. A framework for few-shot language model evaluation, September 2021b. https://doi.org/10.5281/zenodo.5371628.

  37. PAL: Program-aided Language Models
  38. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. https://aclanthology.org/2020.findings-emnlp.301.

  39. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  690–696, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. https://aclanthology.org/P13-2121.

  40. Foundation Models and Fair Use
  41. Measuring Massive Multitask Language Understanding
  42. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE), pp.  837–847. IEEE
  43. Training Compute-Optimal Large Language Models
  44. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=rygGQyrFvH.

  45. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
  46. Social biases in NLP models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5491–5501, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.487. https://aclanthology.org/2020.acl-main.487.

  47. Exploring the Limits of Language Modeling
  48. Learning and evaluating contextual embedding of source code. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org
  49. Scaling Laws for Neural Language Models
  50. A Hazard Analysis Framework for Code Synthesis Large Language Models
  51. Adam: A Method for Stochastic Optimization
  52. The Stack: 3 TB of permissively licensed source code
  53. Large Language Models are Zero-Shot Reasoners
  54. Bradley M. Kuhn. If software is my copilot, who programmed my software? https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/

  55. Measuring bias in contextualized word representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp.  166–172, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3823. https://www.aclweb.org/anthology/W19-3823.

  56. Quantifying the Carbon Emissions of Machine Learning
  57. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
  58. Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, number 2, pp.  896
  59. Comparing code explanations created by students and large language models
  60. Fair learning. Tex. L. Rev., 99:743, 2020. https://texaslawreview.org/fair-learning/.

  61. Amanda Levendowski. How copyright law can fix artificial intelligence’s implicit bias problem. Wash. L. Rev., 93:579
  62. Competition-Level Code Generation with AlphaCode
  63. Holistic Evaluation of Language Models
  64. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  65. Natasha Lomas. Unpicking the rules shaping generative AI. https://techcrunch.com/2023/04/13/generative-ai-gdpr-enforcement/

  66. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
  67. Data Portraits: Recording Foundation Model Training Data
  68. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  622–628, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. https://www.aclweb.org/anthology/N19-1063.

  69. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1878–1898, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.132. https://aclanthology.org/2022.acl-long.132.

  70. Using In-Context Learning to Improve Dialogue Safety
  71. Recurrent neural network based language model. In Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura (eds.), INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp.  1045–1048. ISCA, 2010. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html.

  72. Model cards for model reporting. In danah boyd and Jamie H. Morgenstern (eds.), Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pp.  220–229. ACM, 2019. doi: 10.1145/3287560.3287596. https://doi.org/10.1145/3287560.3287596.
  73. Measuring Data
  74. huggingface/tokenizers: Rust 0.13.2, November 2022. https://doi.org/10.5281/zenodo.7298413.

  75. Crosslingual Generalization through Multitask Finetuning
  76. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5356–5371, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. https://aclanthology.org/2021.acl-long.416.

  77. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
  78. CodeGen: an open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=iaYcJKpY2B_.

  79. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.

  80. Measuring Massive Multitask Language Understanding
  81. OpenAI. GPT-4 system card. https://cdn.openai.com/papers/gpt-4-system-card.pdf, 2023b.

  82. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. https://aclanthology.org/P02-1040.
  83. Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions
  84. Red Teaming Language Models with Language Models
  85. The ROOTS Search Tool: Data Transparency for LLMs
  86. TypeWriter: Neural Type Prediction with Search-Based Validation. In ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020. doi: 10.1145/3368089.3409715.
  87. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  88. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  89. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551
  90. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019. doi: 10.1162/tacla00266. https://aclanthology.org/Q19-1016.

  91. Copyright implications of the use of code repositories to train a machine learning model. https://www.fsf.org/licensing/copilot/copyright-implications-of-the-use-of-code-repositories-to-train-a-machine-learning-model

  92. Lost at C: A user study on the security implications of large language model code assistants
  93. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  94. Fast Transformer Decoding: One Write-Head is All You Need
  95. Arfon Smith. Kernel description. https://github.blog/2016-06-29-making-open-source-data-more-available/

  96. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
  97. The Gradient of Generative AI Release: Methods and Considerations
  98. UL2: Unifying Language Learning Paradigms
  99. Clive Thompson. How an ai became my code-writing genie, Mar 2022. https://www.wired.com/story/openai-copilot-autocomplete-for-code/.

  100. LaMDA: Language Models for Dialog Applications
  101. Choose Your Weapon: Survival Strategies for Depressed AI Academics
  102. LLaMA: Open and Efficient Foundation Language Models
  103. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008
  104. Learning from the worst: Dynamically generated datasets to improve online hate detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1667–1682, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.132. https://aclanthology.org/2021.acl-long.132.

  105. Poisoning language models during instruction tuning
  106. GPT-J-6B: a 6 billion parameter autoregressive language model
  107. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  8696–8708, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. https://aclanthology.org/2021.emnlp-main.685.

  108. Execution-Based Evaluation for Open-Domain Code Generation
  109. Chain of thought prompting elicits reasoning in LLMs. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=_VjQlMeSB_J.

  110. Open science is a research accelerator. Nature Chemistry, 3(10):745–748, October 2011. ISSN 1755-4349. doi: 10.1038/nchem.1149.
  111. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.emnlp-demos.6.

  112. World Economic Forum. Future of jobs report. https://www3.weforum.org/docs/WEF_Future_of_Jobs_2023.pdf

  113. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022, pp.  1–10, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392730. doi: 10.1145/3520312.3534862. https://doi.org/10.1145/3520312.3534862.
  114. Do machine learning models produce TypeScript types that type check? In European Conference on Object-Oriented Programming (ECOOP)
  115. GLM-130B: An Open Bilingual Pre-trained Model
  116. OPT: Open Pre-trained Transformer Language Models
  117. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X
  118. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Show All 118