StarCoder: may the source be with you! (2305.06161v2)

Published 9 May 2023 in cs.CL, cs.AI, cs.PL, and cs.SE

Abstract: The BigCode community, an open-scientific collaboration working on the responsible development of LLMs for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

References (118)

Authors (67)

Raymond Li (24 papers)
Loubna Ben Allal (12 papers)
Yangtian Zi (6 papers)
Niklas Muennighoff (56 papers)
Denis Kocetkov (5 papers)
Chenghao Mou (7 papers)
Marc Marone (11 papers)
Christopher Akiki (15 papers)
Jia Li (380 papers)
Jenny Chim (12 papers)
Qian Liu (252 papers)
Evgenii Zheltonozhskii (22 papers)
Terry Yue Zhuo (32 papers)
Thomas Wang (17 papers)
Olivier Dehaene (4 papers)
Mishig Davaadorj (1 paper)
Joel Lamy-Poirier (9 papers)
Oleh Shliazhko (4 papers)
Nicolas Gontier (8 papers)
Nicholas Meade (12 papers)

Citations (563)

View on Semantic Scholar

Summary

The paper introduces StarCoder and StarCoderBase, demonstrating significant improvements in multi-language support and code generation efficiency.
The paper details a robust training methodology using 1 trillion tokens from diverse GitHub repositories while incorporating copyright, privacy, and attribution safeguards.
The paper provides extensive empirical evaluations showing superior reasoning capabilities and safety measures compared to existing open Code LLMs.

Introduction

The BigCode community has unveiled StarCoder and StarCoderBase, extensive LLMs trained on code data. Featuring 15.5B parameters with an 8K token context length, these models boast infilling capabilities and efficient large-batch inference via multi-query attention. The training corpus for StarCoderBase amounts to 1 trillion tokens sourced from a diverse collection of permissively licensed GitHub repositories known as The Stack. StarCoder is StarCoderBase's fine-tuned counterpart, tailored on 35B Python tokens. A comprehensive evaluation reveals that StarCoderBase surpasses all other open Code LLMs in multiple language support and parallels the performance of OpenAI's code-cushman-001 model. Moreover, StarCoder outshines models fine-tuned on Python while maintaining proficiency in other programming languages.

Model Development

The StarCoder models demonstrate a commitment to responsible development, encompassing copyright respect, privacy protection, and shared community involvement in the development process. Contributing to legal compliance, the PII redaction pipeline has been enhanced and an attribution tool developed, tracing code generations back to training data. Ensuring open access is pivotal to the community-driven approach of the BigCode project. The Stack provides a transparent pre-training dataset with governance tools to verify inclusion and an opt-out process for developers desiring to exclude their code. This effort facilitates external audits and contributions to model improvements and serves as an exemplary open scientific collaboration model.

Empirical Analysis

Evaluation benchmarks the core of Code LLM assessment. The evaluation strategy for StarCoder integrates a diverse array of benchmarks, covering language understanding, reasoning, and toxicity levels. Performance on GSM8K elucidates the reasoning capabilities of StarCoderBase, surpassing similar parameter-sized Code LLMs. Metrics from MMLU and CoQA disclose its language prowess. Meanwhile, RealToxicityPrompts aid in detecting potential biases and toxicity in generated text, an essential safety aspect. StarCoder and StarCoderBase's skilled performance across numerous benchmarks fortifies their staunch positions amid current Code LLMs.

Tools for Safe Deployment

The release of StarCoder models embraces an OpenRAIL-M license, stipulating responsible use restrictions to avert potential misuse in critical scenarios. This initiative addresses the liability by improving transparency and encouraging ethical usage. Augmenting the responsible deployment initiative, new tools for membership checking and a BM25 index search have been published, facilitating users to link model output to training sets effectively. Such tools are pioneering steps towards safeguarding responsible AI deployment, curbing misuse, and bolstering accountability in model-generated code.

In conclusion, the BigCode community's contribution of StarCoder and StarCoderBase represents a significant stride towards the effective and safe application of Code LLMs. With open access, meticulous evaluation, and tools to ensure responsible use, these models stand as beacons of progress while galvanizing community engagement and collaboration.

Tweets

https://twitter.com/aman_madaan/status/1743061447108604167

https://twitter.com/mosaicfactor/status/1820719776110362930

YouTube

Show All Videos