Emergent Mind

StarCoder 2 and The Stack v2: The Next Generation

(2402.19173)
Published Feb 29, 2024 in cs.SE and cs.AI

Abstract

The BigCode project, an open-scientific collaboration focused on the responsible development of LLMs for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

Distribution of the top 20 programming languages in the collected documentation dataset.

Overview

  • StarCoder2 represents a significant evolution in code generation LLMs, offering models with 3B, 7B, and 15B parameters trained on a dataset four times the size of its predecessor.

  • The Stack v2, developed in partnership with Software Heritage, is a comprehensive dataset four times larger than its original version, incorporating a wide range of sources, including GitHub pull requests and Kaggle notebooks.

  • Evaluations of StarCoder2 models demonstrate superior performance in tasks such as code completion, code editing, and mathematical reasoning, with even the smallest model outperforming others of similar size.

  • The advancements in StarCoder2 and The Stack v2 reflect the BigCode project's commitment to open science, ethical data sourcing, and the acceleration of responsible AI development for code generation.

StarCoder 2 and The Stack v2: Advancing the Frontiers of Code Generation LLMs

The BigCode project, an open scientific collaboration focused on the responsible development of LLMs for Code (Code LLMs), recently introduced StarCoder2. This initiative marks a significant advancement in the field of code generation LLMs, extending the foundational work done on the initial StarCoder and The Stack datasets. In partnership with Software Heritage, the project has developed The Stack v2, a vastly expanded corpus for training code generation models. This blog post presents a comprehensive overview of StarCoder2, the development of The Stack v2, and the evaluations performed to gauge the models' capabilities.

Introduction to StarCoder 2

StarCoder2 encompasses a family of models with 3B, 7B, and 15B parameters, pushing the boundaries of what's possible in code generation. These models were trained using a dataset approximately 4 times larger than its predecessor, resulting in significant performance improvements. The training set, rooted in the Software Heritage archive and supplemented with other high-quality datasets, spans over 619 programming languages.

The Development of The Stack v2

The Stack v2 builds upon the digital commons of Software Heritage’s source code archive, enhanced with additional data sources like GitHub pull requests, Kaggle notebooks, and extensive documentation. This meticulously curated and cleaned dataset is 4 times larger than the first version of The Stack, facilitating the training of more nuanced and powerful models.

Evaluation and Benchmarks

StarCoder2 models were evaluated against a suite of benchmarks designed to test code completion, code fixing and editing, mathematical reasoning, and more. These evaluations show that, in many instances, the smaller StarCoder2-3B model outperforms other models of similar size and even surpasses the performance of larger, previously best-performing models. The largest in the family, StarCoder2-15B, sets new standards by matching or outperforming models more than twice its size on several benchmarks.

Repository-Level Code Completion

Focusing on practical applications, the models were assessed on their capability to perform code completion at the repository level, demonstrating significant improvements over earlier models. These improvements are credited to the methodology employed in creating The Stack v2 and the robust training approach that leveraged this expansive dataset.

Advancements and Social Impact

The development of StarCoder2 and The Stack v2 encapsulates the BigCode project’s commitment to open science, ethical data sourcing, and the acceleration of research in the development of Code LLMs. By ensuring transparency in the training data and providing open access to model weights, the project aids in democratizing AI advancements and fostering an environment of responsible AI development. Furthermore, the project addresses challenges in privacy, security, societal and representation biases, underscoring the importance of balanced and mindful technological progress.

Conclusion

StarCoder2 represents a leap forward in the domain of code generation with LLMs, supported by the extensive dataset provided by The Stack v2. These advancements showcase the potential of collaborative, open scientific projects in pushing the boundaries of AI and providing the groundwork for future innovations. As the BigCode project continues to evolve, it remains centered on the pillars of responsible development, open access, and community engagement, paving the way for more inclusive and ethically considered advancements in AI.

Acknowledgements

This work is a testament to the collaborative spirit of the BigCode community, Software Heritage, and all contributors across the globe. It is a powerful example of what can be achieved when the scientific community comes together in pursuit of open, responsible technological advancement.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews