Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP (2112.10508v1)

Published 20 Dec 2021 in cs.CL and cs.LG

Abstract: What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most NLP models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.

Citations (122)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey of tokenization evolution in NLP, detailing the shift from closed-vocabulary word models to open-vocabulary subword approaches.
It highlights the transition from rule-based systems to data-driven methods, showcasing statistical techniques like BPE, Unigram, and SentencePiece.
The analysis discusses practical trade-offs between interpretability and efficiency while exploring multilingual challenges and tokenization-free strategies.

Overview of Open-Vocabulary Modeling and Tokenization in NLP

The paper "Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP" offers a comprehensive survey of the evolution and methodologies in tokenization within NLP. The authors present an in-depth examination of various tokenization strategies, spanning from early approaches to modern techniques, discussing implications and challenges associated with each method. Their investigation connects traditional linguistically motivated methods with contemporary data-driven approaches, painting a holistic view of tokenization's trajectory in NLP.

Tokenization: From Words to Subwords

The survey emphasizes how NLP has transitioned from word-based models to subword approaches. This shift is historically rooted in the need to address limitations posed by closed-vocabulary models, particularly their struggle with rare and out-of-vocabulary (OOV) words. Early approaches operated on discrete "word" units; however, these models could not capture the morphological richness inherent in many languages. The introduction of subword segmentation approaches, such as Byte-Pair Encoding (BPE), became key contributors to modern NLP systems, facilitating smaller vocabularies and enabling the representation of novel words without an explicit need for OOV handling mechanisms.

Approaches to Tokenization

A crucial aspect of the paper is its detailed analysis of various tokenization techniques. For instance, it covers word-level models augmented with character-level information, which provided nuanced insight into word structures—an approach beneficial for handling rare words without shifting entirely to character-level models. The paper also highlights neural LLMs equipped with character-level components to enhance their open-vocabulary capabilities, a strategy that fosters adaptability to unseen text inputs.

Fundamentally, the paper dissects approaches like segmental neural LLMs and Bayesian nonparametrics that attempt to learn segmentation from data, a departure from earlier, manual-rule-based systems like Finite-State Transducers (FSTs). These data-driven systems can adjust to diverse linguistic phenomena where standard whitespace tokenization fails, underscoring the nonlinear progression of tokenization methodologies.

Subword Techniques and Modern Perspectives

The authors delve into modern subword segmentation techniques, such as Unigram LLMs and SentencePiece, that employ statistical heuristics to determine optimal subword vocabularies. These methods, grounded in concepts like Minimum Description Length (MDL) and LLMing efficiency, represent a significant advancement in handling the linguistic variability encountered in multilingual contexts.

A notable observation is the balance between model interpretability and processing efficiency. For instance, character- and byte-level models, although offering simplified maximal decomposition, often result in longer input sequences, leading to increased computational demands. Despite such challenges, models like ByT5 have shown robustness to input perturbations, evidencing practical trade-offs inherent in these approaches.

Multilingual and "Tokenization-Free" Models

The paper addresses tokenization's role in multilingual models, recognizing the complexities of shared vocabularies and subword selection's impact on cross-linguistic transfer. Furthermore, it explores innovations in bypassing traditional tokenization via "tokenization-free" strategies, such as character hashing in CANINE, which propose parameter-sharing mechanisms to cope with vast character sets without succumbing to inefficiencies.

Conclusion and Future Directions

In conclusion, the survey asserts that no singular method of tokenization uniformly excels across all applications. Tokenization remains a specialized task where different applications might warrant different approaches, depending on computational requirements and domain-specific needs. As advances continue to reshape the landscape of NLP, ongoing explorations into tokenization's intersection with efficiency, interpretability, and multilingual capacity will be pivotal in guiding the direction of future research and applications in the field.

Related Papers

Tweets

https://twitter.com/sjmielke/status/1760419750839005450

https://twitter.com/pandeyparul/status/1790736926909104484