MorphPiece : A Linguistic Tokenizer for Large Language Models (2307.07262v2)

Published 14 Jul 2023 in cs.CL

Abstract: Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for LLMs are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal LLM trained on this tokenizer (called MorphGPT) shows comparable or superior performance on a variety of supervised and unsupervised NLP tasks, compared to the OpenAI GPT-2 model. Specifically I evaluated MorphGPT on language modeling tasks, zero-shot performance on GLUE Benchmark with various prompt templates, massive text embedding benchmark (MTEB) for supervised and unsupervised performance, and lastly with another morphological tokenization scheme (FLOTA, Hoffmann et al., 2022) and find that the model trained on MorphPiece outperforms GPT-2 on most evaluations, at times with considerable margin, despite being trained for about half the training iterations.

References (46)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces MorphPiece, a hybrid tokenization scheme combining BPE with morpheme-based segmentation to provide linguistically aligned token splits.
It leverages a MorphTable built from MorphyNet, refining 346,340 word segments into a vocabulary of 50,006 tokens to optimize language representation.
MorphGPT-Base, trained with MorphPiece, achieves superior results over GPT-2 with nearly 10% higher LAMBADA accuracy and improved zero-shot evaluations.

MorphPiece: A Linguistic Tokenizer for LLMs

This paper introduces MorphPiece, a novel tokenization scheme for LLMs that integrates morphological segmentation with statistical methods to improve linguistic alignment. The author posits that current tokenizers, relying primarily on statistical analysis, neglect valuable linguistic features present in natural language. By incorporating morphological information, MorphPiece aims to create more natural and efficient subword tokenizations, leading to enhanced model performance across various NLP tasks.

MorphPiece Tokenization Scheme

The MorphPiece tokenization scheme (Figure 1) combines BPE with morpheme-based segmentation. Initially, the input text undergoes normalization and pre-tokenization using standard BPE. These pre-tokens are then processed through a lookup table named MorphTable, which contains a pre-computed morphological segmentation of English words. If a segmentation is found in MorphTable, the pre-token is replaced with its corresponding morphemes; otherwise, the token is split using BPE with a custom-trained vocabulary.

Figure 1: MorphPiece tokenization scheme integrates BPE pre-tokenization with a lookup in MorphTable for morpheme-based segmentation, falling back to BPE with a custom-trained vocabulary when no morphological segmentation is available.

MorphTable Construction

MorphTable is constructed using MorphyNet, a database of derivational and inflectional morphology for 15 languages. From this database, a lookup table of 346,340 English words segmented into morphemes was created. After filtering for entities with at least 5 occurrences in the training corpus, the vocabulary was trimmed to 18,304 tokens, with a table size of 134,943 entries.

Vocabulary Composition

The MorphPiece vocabulary comprises two components: the affixes and stems extracted from MorphTable, and a BPE vocabulary trained on OpenWebText. The BPE vocabulary is trained to a size of 32,000 tokens, aiming for a final vocabulary size of 50,257 tokens, comparable to GPT-2. Words with segmentations available in MorphTable are excluded from the BPE training corpus. The final vocabulary size is 50,006 tokens.

Tokenization Examples

The paper provides examples that highlight how MorphPiece splits words into linguistically aligned affixes, which have semantic meaning. For instance, "paratrooper" is segmented as ('para#', 'troop', '#er') in MorphPiece, aligning with the linguistic parts of the word. In contrast, BPE and WordPiece tokenizers split it into ('par', 'atro', 'oper') and ('para', '##tro', '##oper'), respectively. The location and presence/absence of the '#' symbol denotes prefixes, suffixes, compound words, and stems. The paper claims that this aligns more closely with the linguistic parts of the word compared to purely statistical tokenizers.

MorphGPT LLM

To validate the effectiveness of MorphPiece, the author trained a GPT-2 (Base) architecture with MorphPiece, named MorphGPT-Base, and compared it against the OpenAI GPT-2 model that uses BPE.

Training Details

MorphGPT-Base was trained for 200k steps using the OpenWebText corpus, with a batch size of 512 and a one-cycle learning rate scheduler. The training was performed on Nvidia A-100 GPUs using HuggingFace's implementation of GPT-2 with Pytorch-Lightning. The author estimated that GPT-2 was trained for approximately 400k-500k steps.

Evaluation Tasks

The performance of MorphGPT was evaluated on various NLP tasks, including perplexity on different datasets, the LAMBADA task, MTEB, and zero-shot prompt-based evaluations on GLUE.

Results and Comparison

MorphGPT consistently demonstrates superior performance compared to GPT-2 across almost all evaluations, despite being trained for approximately half the number of steps. Specifically, MorphGPT achieved significantly better token-level perplexity scores, with performance comparable to GPT-2 (Large) after 200k steps. On the LAMBADA task, MorphGPT surpassed the accuracy of GPT-2 by almost 10% with only 50k steps, nearly reaching the accuracy of the GPT-2 Large model. In zero-shot GLUE evaluations, MorphGPT generally outperformed GPT-2, both in raw accuracy and the number of prompt templates where it showed superior performance.

The paper also presents a comparison with FLOTA, a tokenization improvement method that attempts to preserve the morphological structure of words during tokenization. MorphGPT outperformed FLOTA comprehensively on a classification task using a custom dataset of titles from arXiv, showing improvements of more than 35% over vanilla GPT-2, compared to about 6% for GPT-2+FLOTA.

Massive Text Embedding Benchmark (MTEB)

MorphGPT was evaluated on the MTEB, which consists of 8 embedding tasks covering a total of 58 datasets. MorphGPT outperforms GPT-2 across all 7 monolingual tasks.

Detokenization Process

The paper addresses the detokenization process, which involves converting the tokens produced by a MorphPiece-trained model back into coherent sentences. The detokenization process consists of classifying tokens (Figure 2) as either 'morph' or 'bpe' based on their source. MorphPiece tokens are further annotated as prefix, suffix, stem, or hash (for compound words).

Figure 2: An example of detokenization illustrating how tokens are classified based on their source (MorphPiece or BPE) and how word boundaries are identified using the detokenization mechanism.

The detokenization mechanism (Figure 3) uses a reverse MorphTable to convert morpheme sequences back into English words, considering various cases like compound words and multiple affixes.

Figure 3: The detokenization mechanism illustrates the process of converting morphemes back into English words, with black lines indicating word continuation and red dashed lines indicating word boundaries.

Limitations

The paper acknowledges limitations, including incomplete coverage of lexical families in MorphyNet, the need to construct separate MorphTables and detokenization automata for each language, and a 17% increase in the number of tokens compared to BPE.

Conclusion

The author concludes that MorphPiece represents a linguistically motivated tokenization scheme that outperforms models trained on BPE across a wide variety of tasks. The paper suggests that incorporating linguistic inductive bias into tokenization can lead to a new generation of models that move away from purely statistical language representation.