Emergent Mind

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

(2407.16607)
Published Jul 23, 2024 in cs.CL and cs.LG

Abstract

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

Two tokenizers trained on mixed English and Python data to solve data mixture inference.

Overview

  • The paper introduces a method called 'data mixture inference' that uses byte-pair encoding (BPE) tokenizers to infer the distribution of training data categories for language models.

  • The authors formulate the analysis as a linear programming problem, leveraging the ordered list of merge rules in BPE tokenizers to estimate the proportions of different training data categories.

  • Controlled experiments and applications to commercial tokenizers demonstrate the method's high accuracy and its implications for security, privacy, auditing, and fairness in AI models.

Data Mixture Inference: What Do BPE Tokenizers Reveal About Their Training Data?

The paper "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?" explore the task of uncovering the distributional makeup of the pretraining datasets used for language models (LMs). By leveraging information embedded within byte-pair encoding (BPE) tokenizers, the authors propose an innovative attack method they term "data mixture inference."

Key Insights and Methodology

The foundational insight driving this work is the observation that the ordered list of merge rules learned by a BPE tokenizer inherently reflects token frequency information from the training data. Each step in the BPE training process merges the most frequent pair of bytes or tokens, ordered from the highest frequency to the lowest. Therefore, the sequence of merges effectively encodes information about the distribution of different data categories within the underlying dataset.

The authors formalize this into a linear programming (LP) problem. The task involves solving for the proportions of different categories (e.g., natural languages, programming languages, or data domains) that make up the tokenizer's training set. By applying the BPE tokenizer to data samples from each category and measuring the resulting pair frequencies, they set up constraints that reflect the relative frequencies of pairs in each category. The LP solver then estimates the proportions that best satisfy these constraints.

Controlled Experiments

In controlled experiments, tokenizers were trained on known mixtures of natural languages, programming languages, and data sources. The findings are compelling:

  • The method achieves between three and six orders of magnitude better accuracy than random guessing.
  • For natural languages, which inherently have distinct vocabularies, the method exhibits the highest success rate.
  • For English domain mixtures, accuracy remains significantly better than random, despite subtler vocabulary differences.

Application to Commercial Tokenizers

Applying their method to commercial tokenizers, the authors uncover various quantitative details:

  • GPT-4's tokenizer is notably multilingual, trained on 39% non-English data.
  • GPT-3.5's tokenizer includes substantial code data, with code making up approximately 60% of its training set.
  • LLaMa3 extends GPT-3.5's tokenizer to enhance multilingual capabilities, with 48% non-English data.

Practical and Theoretical Implications

This research has several implications:

  1. Security and Privacy: Revealing the distributional properties of training data can leak proprietary information or enable targeted poisoning attacks.
  2. Auditing and Fairness: Understanding data mixtures aids in auditing models for biases, highlighting over- or under-represented languages and domains.
  3. Technical Insight: This method offers a tool for indirectly inferring pretraining data distribution when direct access to data is restricted.

Future Developments

Future research directions include:

  • Extending the method to handle more complex tokenization schemes, such as those beyond BPE.
  • Enhancing robustness against distribution shifts between the data used for tokenization and the actual pretraining set.
  • Exploring practical defenses that model producers could employ to mitigate such inference attacks.

Conclusion

The method proposed in this paper provides a powerful tool for inferring the distribution of training data from the properties of BPE tokenizers. By reverse engineering the merge lists, the study illuminates otherwise opaque aspects of model training data, contributing significantly to the discourse on model transparency, security, and fairness. As AI systems continue to integrate into critical societal functions, such research is pivotal in ensuring equitable and secure AI deployments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.