Emergent Mind

To Code, or Not To Code? Exploring Impact of Code in Pre-training

(2408.10914)
Published Aug 20, 2024 in cs.CL

Abstract

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

Impact of varying code factors on model performance across diverse tasks evaluated.

Overview

  • The paper investigates the impact of incorporating code in the pre-training mixtures of LLMs on multiple downstream tasks that extend beyond code generation.

  • Key findings include the significant performance boost in non-code tasks with code data, the optimal proportion of code data being 25%, and the beneficial role of high-quality, synthetically generated code.

  • The authors use a robust experimental framework, evaluating models of varying sizes across tasks such as natural language reasoning, world knowledge, and code benchmarks, and propose practical insights for future LLM designs.

Overview of "To Code, or Not To Code? Exploring Impact of Code in Pre-training"

The paper "To Code, or Not To Code? Exploring Impact of Code in Pre-training" by Viraat Aryabumi et al. provides a thorough investigation of the role of code in the pre-training mixtures of LLMs, specifically focusing on its impact on downstream tasks that extend beyond code generation. The study is predicated on anecdotal consensus among LLM practitioners about the importance of code data for improving general performance, but it aims to systematically and empirically analyze this impact across various tasks and model sizes.

Key Findings and Contributions

The authors address multiple aspects of code data utilization in pre-training through a series of well-defined, large-scale experiments. These include examining the initialization strategies, varying proportions of code, the quality and properties of code datasets, and the introduction of code in the pre-training cooldown phase. The results confirm several significant points:

Importance of Code in Pre-training:

  • The inclusion of code data significantly boosts performance in non-code tasks. The best model configuration, balanced→text followed by cooldown with code, outperformed the text-only pre-training baseline with a relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, a 6.6% improvement in generative win-rates, and a 12x boost in code performance.

Impact of Initialization:

  • Models initialized with code-pretrained models (code→text and balanced→text) generally showed improved performance over those that do not include code, underlining the benefit of starting with mixed pre-training, even when the focus is on non-code tasks.

Optimal Proportion of Code:

  • The study found that a proportion of 25% code data in the pre-training mix maximized performance across NL reasoning tasks, with higher proportions improving code generation performance linearly but potentially degrading world knowledge task performance.

Quality and Type of Code Data:

  • Including high-quality, synthetically generated code data even in small proportions had a significant positive impact. The authors report a 9% improvement in NL reasoning and a 44.9% increase in code performance when synthetic data was included.

Role of Cooldown Phase:

  • Enhancements were observed when code data was included in the cooldown phase, a stage where high-quality datasets are up-weighted, with improvements of 3.6% in NL reasoning, 10.1% in world knowledge, and 20% in code performance relative to the model before the cooldown phase.

Methodology and Evaluation

The experimental framework of this study is robust, involving models of varying sizes, namely 470M to 2.8B parameters, evaluated across a wide spectrum of tasks including natural language reasoning, world knowledge, code benchmarks, and generative quality assessed via LLM-as-a-judge win-rates.

Pre-training Data:

  • The authors used a variety of code data sources including web-based code, markup-style data, synthetic code data, and code-adjacent datasets. Text data was drawn from the SlimPajama dataset, excluding code-related documents to isolate the effect of code data.

Training Strategy:

  • Models were subjected to continuous pre-training and a unique cooldown phase, with controlled variations in learning rate schedules and data weightings to assess the specific contributions of each stage.

Evaluation Suite:

  • The evaluation suite consisted of benchmarks that tested world knowledge (e.g., TriviaQA, Natural Questions Open), NL reasoning (e.g., BoolQ, PiQA, HellaSwag), and code performance (e.g., HumanEval-Python, MBPP). Generative performance was also assessed using win-rates from LLM-as-a-judge evaluations.

Implications and Future Directions

The outcomes of this study support the significant role of code in pre-training mixtures, beyond the domain of code generation. This not only suggests a reassessment of current pre-training data compositions but also points to strategic investments in quality-controlled and synthetic code data.

From a practical perspective, the insights from this work could guide the design of more versatile LLMs that are capable of excelling across diverse tasks, including those requiring sophisticated reasoning and general knowledge. Future research could expand on these insights by exploring larger model scales, investigating the role of code in safety and ethical considerations, and dynamically adjusting the proportion of code data during different pre-training phases.

Conclusion

This paper provides a comprehensive and empirical basis for the inclusion of code data in LLM pre-training, highlighting its multifaceted benefits and offering pragmatic guidelines on optimizing pre-training recipes. The thorough experimental methodology and extensive evaluation strengthen the credibility of the findings, which collectively advance our understanding of the critical role of code in augmenting LLM performance.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube