The BigCode community, an open-scientific collaboration working on the responsible development of LLMs for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
StarCoder and StarCoderBase are LLMs trained on code data, with StarCoderBase having 15.5B parameters and an 8K token context length for efficient large-batch inference.
The Stack, a collection of GitHub repositories, provided a 1 trillion token corpus for StarCoderBase, while StarCoder was fine-tuned on 35B Python tokens.
StarCoderBase excels in multi-language support, matching OpenAI's code-cushman-001 model, while StarCoder excels in Python and retains multi-language proficiency.
The developers have emphasized responsible AI development, with a PII redaction pipeline and tools for tracing code generations to training data to ensure legal compliance.
Evaluation strategies for StarCoder cover language understanding, reasoning, and safety aspects, with the model performing well across various benchmarks.
The BigCode community has unveiled StarCoder and StarCoderBase, extensive LLMs trained on code data. Featuring 15.5B parameters with an 8K token context length, these models boast infilling capabilities and efficient large-batch inference via multi-query attention. The training corpus for StarCoderBase amounts to 1 trillion tokens sourced from a diverse collection of permissively licensed GitHub repositories known as The Stack. StarCoder is StarCoderBase's fine-tuned counterpart, tailored on 35B Python tokens. A comprehensive evaluation reveals that StarCoderBase surpasses all other open Code LLMs in multiple language support and parallels the performance of OpenAI's code-cushman-001 model. Moreover, StarCoder outshines models fine-tuned on Python while maintaining proficiency in other programming languages.
The StarCoder models demonstrate a commitment to responsible development, encompassing copyright respect, privacy protection, and shared community involvement in the development process. Contributing to legal compliance, the PII redaction pipeline has been enhanced and an attribution tool developed, tracing code generations back to training data. Ensuring open access is pivotal to the community-driven approach of the BigCode project. The Stack provides a transparent pre-training dataset with governance tools to verify inclusion and an opt-out process for developers desiring to exclude their code. This effort facilitates external audits and contributions to model improvements and serves as an exemplary open scientific collaboration model.
Evaluation benchmarks the core of Code LLM assessment. The evaluation strategy for StarCoder integrates a diverse array of benchmarks, covering language understanding, reasoning, and toxicity levels. Performance on GSM8K elucidates the reasoning capabilities of StarCoderBase, surpassing similar parameter-sized Code LLMs. Metrics from MMLU and CoQA disclose its language prowess. Meanwhile, RealToxicityPrompts aid in detecting potential biases and toxicity in generated text, an essential safety aspect. StarCoder and StarCoderBase's skilled performance across numerous benchmarks fortifies their staunch positions amid current Code LLMs.
The release of StarCoder models embraces an OpenRAIL-M license, stipulating responsible use restrictions to avert potential misuse in critical scenarios. This initiative addresses the liability by improving transparency and encouraging ethical usage. Augmenting the responsible deployment initiative, new tools for membership checking and a BM25 index search have been published, facilitating users to link model output to training sets effectively. Such tools are pioneering steps towards safeguarding responsible AI deployment, curbing misuse, and bolstering accountability in model-generated code.
In conclusion, the BigCode community's contribution of StarCoder and StarCoderBase represents a significant stride towards the effective and safe application of Code LLMs. With open access, meticulous evaluation, and tools to ensure responsible use, these models stand as beacons of progress while galvanizing community engagement and collaboration.
Unified pre-training for program understanding and generation. In Proceedings of NAACL, 2021. https://aclanthology.org/2021.naacl-main.211.
BBC. ChatGPT accessible again in Italy. https://www.bbc.com/news/technology-65431914
A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, December 2022.
A neural probabilistic language model. In T. Leen, T. Dietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. https://proceedings.neurips.cc/paper_files/paper/2000/hash/728f206c2a01bf572b5940d7d9a8fa4c-Abstract.html.
BigScience Workshop. BLOOM (revision 4ab0472), 2022. https://huggingface.co/bigscience/bloom.
Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867, Prague, Czech Republic, June 2007. Association for Computational Linguistics. https://aclanthology.org/D07-1090.
N-gram counts and language models from the Common Crawl. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3579–3584, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdf.
Matthew Butterick. This CoPilot is stupid and wants to kill me. https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html
BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. https://aclanthology.org/N19-1423.
Euronews. Microsoft attracting users to its code-writing, generative AI software. https://www.euronews.com/next/2023/01/25/microsoft-results-ai
European Council. The general data protection regulation. https://www.consilium.europa.eu/en/policies/data-protection/data-protection-regulation/
A framework for few-shot language model evaluation, September 2021b. https://doi.org/10.5281/zenodo.5371628.
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. https://aclanthology.org/2020.findings-emnlp.301.
Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 690–696, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. https://aclanthology.org/P13-2121.
The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=rygGQyrFvH.
Social biases in NLP models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5491–5501, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.487. https://aclanthology.org/2020.acl-main.487.
Bradley M. Kuhn. If software is my copilot, who programmed my software? https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/
Measuring bias in contextualized word representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 166–172, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3823. https://www.aclweb.org/anthology/W19-3823.
Fair learning. Tex. L. Rev., 99:743, 2020. https://texaslawreview.org/fair-learning/.
Natasha Lomas. Unpicking the rules shaping generative AI. https://techcrunch.com/2023/04/13/generative-ai-gdpr-enforcement/
On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 622–628, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. https://www.aclweb.org/anthology/N19-1063.
An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1878–1898, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.132. https://aclanthology.org/2022.acl-long.132.
Recurrent neural network based language model. In Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura (eds.), INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp. 1045–1048. ISCA, 2010. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html.
huggingface/tokenizers: Rust 0.13.2, November 2022. https://doi.org/10.5281/zenodo.7298413.
StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5356–5371, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. https://aclanthology.org/2021.acl-long.416.
CodeGen: an open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=iaYcJKpY2B_.
In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
OpenAI. GPT-4 system card. https://cdn.openai.com/papers/gpt-4-system-card.pdf, 2023b.
CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019. doi: 10.1162/tacla00266. https://aclanthology.org/Q19-1016.
Copyright implications of the use of code repositories to train a machine learning model. https://www.fsf.org/licensing/copilot/copyright-implications-of-the-use-of-code-repositories-to-train-a-machine-learning-model
Arfon Smith. Kernel description. https://github.blog/2016-06-29-making-open-source-data-more-available/
Clive Thompson. How an ai became my code-writing genie, Mar 2022. https://www.wired.com/story/openai-copilot-autocomplete-for-code/.
Learning from the worst: Dynamically generated datasets to improve online hate detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1667–1682, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.132. https://aclanthology.org/2021.acl-long.132.
CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8696–8708, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. https://aclanthology.org/2021.emnlp-main.685.
Chain of thought prompting elicits reasoning in LLMs. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=_VjQlMeSB_J.
Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.emnlp-demos.6.
World Economic Forum. Future of jobs report. https://www3.weforum.org/docs/WEF_Future_of_Jobs_2023.pdf