Emergent Mind

Tokenization Is More Than Compression

(2402.18376)
Published Feb 28, 2024 in cs.CL and cs.AI

Abstract

Tokenization is a foundational step in NLP tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer tokens lead to better downstream performance by introducing PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary. Through extensive experimentation we find this hypothesis not to be the case, casting doubt on the understanding of the reasons for effective tokenization. To examine which other factors play a role, we evaluate design decisions across all three phases of tokenization: pre-tokenization, vocabulary construction, and segmentation, offering new insights into the design of effective tokenizers. Specifically, we illustrate the importance of pre-tokenization and the benefits of using BPE to initialize vocabulary construction. We train 64 language models with varying tokenization, ranging in size from 350M to 2.4B parameters, all of which are made publicly available.

Pre-tokenization process of PathPieceL $n$-gram with a parameter $p$ value of 0.54.

Overview

  • The paper introduces PathPiece, a tokenizer developed to minimize Corpus Token Count (CTC) and explores tokenization's effects on NLP performance.

  • It challenges the assumption that fewer tokens correlate with better downstream task performance, providing evidence to the contrary.

  • The study conducts experiments across different tokenization stages with 64 language models, revealing no direct correlation between reduced CTC and enhanced performance.

  • Findings suggest the importance of pre-tokenization, vocabulary construction methods, and segmentation strategies over mere token count reduction.

Unraveling the Intricacies of Tokenization in NLP: A Study Through the Lens of PathPiece

Introduction

Tokenization, a pivotal preprocessing stage in NLP, translates human-readable text into tokens for subsequent utilization by statistical models. Bridging the gap between raw text and language models, tokenization significantly impacts the effectiveness of NLP applications. This paper scrutinizes tokenization beyond the conventional belief, questioning the effectiveness of reducing the number of tokens for improved downstream performance. Introducing PathPiece, a tokenizer designed to minimize the Corpus Token Count (CTC), this research provides a detailed examination of tokenization, shedding light on factors crucial for its effectiveness and debunking several prevailing assumptions within the field.

Related Work and Background

Tokenization traditionally is segmented into three stages: pre-tokenization, vocabulary construction, and segmentation. Each plays a unique role in splitting text into manipulable units for models. While Byte-Pair Encoding (BPE) and its variants like WordPiece and Unigram have dominated the scene, their focus primarily has been on compression efficiency, hypothesizing a link between fewer tokens and better model performance. However, the introduction of PathPiece challenges this notion by directly comparing CTC with downstream task performance across varied tokenization strategies.

The PathPiece Experiment

PathPiece emerged from examining whether a tokenizer that inherently minimizes token count could outperform traditional methods in downstream tasks. The tokenization process was dissected into its foundational stages, allowing a detailed analysis of how changes in each phase affect overall performance. Through training 64 language models with varying tokenization strategies and sizes, ranging from 350 million to 2.4 billion parameters, the study provided an extensive dataset for comparison. The experiments included altering pre-tokenization rules, vocabulary construction mechanisms, and segmentation methods, offering rich insights into the tokenization process. Remarkably, PathPiece's top-down vocabulary construction approach presented an ideal scenario to explore the hypothesized benefits of reduced CTC.

Insights and Findings

Contrary to popular belief, the study found no direct correlation between reduced CTC and enhanced downstream performance. This revelation challenges the core assumption behind the effectiveness of methods like BPE, suggesting that factors beyond mere token count compression play substantial roles in determining the effectiveness of tokenization strategies. Specifically, the study highlighted:

  • The significance of pre-tokenization, with findings suggesting that how spaces and digits are handled can influence model performance more than the overall token count.
  • Variability in the efficacy of vocabulary construction methods, where a top-down approach using BPE for initializing vocabulary performed best among the tested strategies.
  • A nuanced understanding of segmentation, showing that the choice of segmentation method (including length vs. random tie-breaking strategies) does not significantly impact model performance when controlling for other variables.

Implications and Future Directions

This research contributes to a re-evaluation of tokenization as understood in NLP. By breaking down tokenization into its component stages and rigorously testing each, the paper invites a more nuanced appreciation of what constitutes effective tokenization. The findings open several avenues for future research, particularly in exploring the morphology and semantics inherent in the tokenization process and its impact on model understanding and performance. Additionally, the introduction of PathPiece and the comprehensive dataset of trained models provide valuable resources for continued exploration in this domain.

Conclusion

The investigation into the effects of tokenization on downstream performance using PathPiece reveals a complex landscape where reducing the number of tokens does not necessarily equate to better model performance. This challenges previous assumptions about tokenization efficacy, emphasizing the importance of a multifaceted approach to tokenizer design. Through a meticulous examination of tokenization stages and their impact on language models, this study contributes to a deeper understanding of the fundamental processes underpinning NLP, paving the way for more informed and effective tokenizer developments in the future.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.