Emergent Mind

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

(2403.06265)
Published Mar 10, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be viewed as 0-gram language modeling where equal probability is assigned to all tokens. We also demonstrate the empirical importance of compression for downstream success of pre-trained language models. We control the compression ability of several BPE tokenizers by varying the amount of documents available during their training: from 1 million documents to a character-based tokenizer equivalent to no training data at all. We then pre-train English language models based on those tokenizers and fine-tune them over several tasks. We show that there is a correlation between tokenizers' compression and models' downstream performance, suggesting that compression is a reliable intrinsic indicator of tokenization quality. These correlations are more pronounced for generation tasks (over classification) or for smaller models (over large ones). We replicated a representative part of our experiments on Turkish and found similar results, confirming that our results hold for languages with typological characteristics dissimilar to English. We conclude that building better compressing tokenizers is a fruitful avenue for further research and for improving overall model performance.

Chart showing the relationship between Turkish word abundance and subword count in unseen documents.

Overview

  • The paper examines the role of text compression in tokenization and its impact on the performance of pre-trained language models, finding a correlation between a tokenizer's compression ability and language model success.

  • A methodical comparison of tokenizers trained with varying amounts of data reveals that more training data improves a tokenizer's text compression ability, which enhances model performance in downstream tasks.

  • The study's findings are consistent across different languages, as demonstrated by experiments in both English and Turkish, highlighting the universal importance of effective tokenization.

  • Exploration of tokenizer's intrinsic quality through its compression efficiency offers new avenues for future research, emphasizing the need for better compressing tokenizers to improve language model performance.

Unpacking Tokenization: A Close Look at Text Compression and Model Performance

Introduction

The paper "Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance" explores the significance of text compression in the tokenization process and its correlation with the downstream success of pre-trained LLMs. The authors argue that text compression can be viewed as a form of $0$-gram language modeling where all tokens are assigned equal probability. By manipulating the compression ability of Byte Pair Encoding (BPE) tokenizers through varying the amount of training data—ranging from a character-level tokenizer (equivalent to zero training data) to tokenizers trained on 1 million documents—the authors endeavor to elucidate the intrinsic quality of tokenizers and their extrinsic impact on model performance across several tasks and languages.

Methodology

The authors compared tokenizers by controlling the "support," i.e., the amount of training data available to them. This approach allowed for an exploration of how tokenizer compression abilities impact language model performance across different tasks. The English language was the primary focus, with models pre-trained on the C4 corpus and fine-tuned on a combination of classification and generation tasks. For intrinsic evaluation, tokenizers' ability to compress text was measured, while extrinsic evaluation focused on performance across selected NLP tasks. Additionally, Turkish was selected for a subset of experiments to confirm whether findings hold across languages with different typological characteristics.

Findings

Compression Ability: The study found a direct correlation between a tokenizer's compression ability and the amount of supporting data it was trained on. Tokenizers trained with minimal data produced texts significantly longer than those trained with adequate data. The more the supporting data, the better the compression.

Extrinsic Performance: The experiments demonstrated a monotonic relationship between the amount of supporting data a tokenizer had and its subsequent performance in downstream tasks. This correlation was found to be stronger for generation tasks and more pronounced in smaller models.

Language Generalization: The patterns observed in English held true when tested on Turkish, suggesting that the importance of text compression in tokenization is not language-specific.

Analysis

The paper breaks new ground by quantitatively demonstrating the importance of tokenization—specifically its compression capability—on the performance of language models. The results suggest that tokenization, particularly for generative tasks or when using smaller models, is crucial. This stands to reason as generative tasks require extensive use of the tokenizer, and smaller models have less capacity to compensate for poor tokenization.

Interestingly, the intrinsic and extrinsic evaluations of tokenization quality presented in this study reveal a clear path for future research and development: creating better compressing tokenizers could lead to improved overall model performance. It was also noted that tokenizer support directly affects its efficiency in compression, pointing to the potential benefits of increasing the supporting dataset size during tokenizer training.

Conclusion

This paper contributes a novel perspective on the crucial role of tokenization in the development of LLMs by showcasing the intrinsic value of compression as an indicator of tokenizer quality and its correlation with downstream task performance. The findings across English and Turkish emphasize the importance of compression in tokenization and suggest beneficial directions for future tokenizer development. As larger and more complex models continue to evolve, understanding the foundational elements, such as tokenization, becomes imperative for improving efficiency and effectiveness in natural language processing tasks.

Future Work

While this study provides significant insights, it also opens avenues for future research, including expanding the experiments to other languages and exploring other intrinsic measures of tokenization quality. Additionally, investigating the impact of tokenization on larger models could further refine our understanding of its role in the performance of LLMs.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.