Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

MorphPiece : A Linguistic Tokenizer for Large Language Models (2307.07262v2)

Published 14 Jul 2023 in cs.CL

Abstract: Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for LLMs are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal LLM trained on this tokenizer (called MorphGPT) shows comparable or superior performance on a variety of supervised and unsupervised NLP tasks, compared to the OpenAI GPT-2 model. Specifically I evaluated MorphGPT on LLMing tasks, zero-shot performance on GLUE Benchmark with various prompt templates, massive text embedding benchmark (MTEB) for supervised and unsupervised performance, and lastly with another morphological tokenization scheme (FLOTA, Hoffmann et al., 2022) and find that the model trained on MorphPiece outperforms GPT-2 on most evaluations, at times with considerable margin, despite being trained for about half the training iterations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Evaluating various tokenizers for arabic text classification. CoRR, abs/2106.07540, 2021. URL https://arxiv.org/abs/2106.07540.
  2. Arabert: Transformer-based model for arabic language understanding. CoRR, abs/2003.00104, 2020. URL https://arxiv.org/abs/2003.00104.
  3. An evaluation of two vocabulary reduction methods for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pp. 97–110, Boston, MA, March 2018. Association for Machine Translation in the Americas. URL https://aclanthology.org/W18-1810.
  4. PromptSource: An integrated development environment and repository for natural language prompts. In Basile, V., Kozareva, Z., and Stajner, S. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  93–104, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-demo.9. URL https://aclanthology.org/2022.acl-demo.9.
  5. Meaningless yet meaningful: Morphology grounded subword-level NMT. In Proceedings of the Second Workshop on Subword/Character LEvel Models, pp.  55–60, New Orleans, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-1207. URL https://aclanthology.org/W18-1207.
  6. MorphyNet: a large multilingual database of derivational and inflectional morphology. In Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp.  39–48, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.sigmorphon-1.5. URL https://aclanthology.org/2021.sigmorphon-1.5.
  7. The SIGMORPHON 2022 shared task on morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp.  103–116, Seattle, Washington, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.sigmorphon-1.11. URL https://aclanthology.org/2022.sigmorphon-1.11.
  8. Enriching word vectors with subword information. CoRR, abs/1607.04606, 2016. URL http://arxiv.org/abs/1607.04606.
  9. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4617–4624, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.414. URL https://aclanthology.org/2020.findings-emnlp.414.
  10. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  11. A joint model of orthography and morphological segmentation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  664–669, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1080. URL https://aclanthology.org/N16-1080.
  12. Are all languages equally hard to language-model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  536–541, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2085. URL https://aclanthology.org/N18-2085.
  13. Unsupervised morphology induction using morfessor. In FSMNLP, volume 4002 of Lecture Notes in Computer Science, pp.  300–301. Springer, 2005.
  14. How much does tokenization affect neural machine translation? CoRR, abs/1812.08621, 2018. URL http://arxiv.org/abs/1812.08621.
  15. Falcon, W. and The PyTorch Lightning team. PyTorch Lightning, 3 2019. URL https://github.com/Lightning-AI/lightning.
  16. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  17. Morfessor EM+Prune: Improved subword segmentation with expectation maximization and pruning. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  3944–3953, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.486.
  18. DagoBERT: Generating derivational morphology with a pretrained language model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020.
  19. Superbizarre is not superb: Improving bert’s interpretations of complex words with derivational morphology. CoRR, abs/2101.00403, 2021. URL https://arxiv.org/abs/2101.00403.
  20. An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022.
  21. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, 02 2021. ISSN 1367-4803. doi: 10.1093/bioinformatics/btab083. URL https://doi.org/10.1093/bioinformatics/btab083.
  22. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  23. Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. CoRR, abs/1804.10959, 2018. URL http://arxiv.org/abs/1804.10959.
  24. Morpho challenge 2005-2010: Evaluations and results. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pp.  87–95, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL https://aclanthology.org/W10-2211.
  25. Meal: Stable and active learning for few-shot prompting, 2023.
  26. Improving language model of human genome for dna-protein binding prediction based on task-specific pre-training. Interdisciplinary sciences, computational life sciences, 15(1):32—43, March 2023. ISSN 1913-2751. doi: 10.1007/s12539-022-00537-9. URL https://doi.org/10.1007/s12539-022-00537-9.
  27. Morphological and language-agnostic word segmentation for NMT. CoRR, abs/1806.05482, 2018. URL http://arxiv.org/abs/1806.05482.
  28. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology.org/J93-2004.
  29. Using morphological knowledge in open-vocabulary neural language models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1435–1445, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1130. URL https://aclanthology.org/N18-1130.
  30. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013. URL http://arxiv.org/abs/1310.4546.
  31. MTEB: Massive text embedding benchmark. In Vlachos, A. and Augenstein, I. (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.148. URL https://aclanthology.org/2023.eacl-main.148.
  32. Morphological word segmentation on agglutinative languages for neural machine translation. CoRR, abs/2001.01589, 2020. URL http://arxiv.org/abs/2001.01589.
  33. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144.
  34. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp.  1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
  35. Deep contextualized word representations. CoRR, abs/1802.05365, 2018. URL http://arxiv.org/abs/1802.05365.
  36. V-measure: A conditional entropy-based external cluster evaluation measure. In Eisner, J. (ed.), Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.  410–420, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/D07-1043.
  37. The effectiveness of morphology-aware segmentation in low-resource neural machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.  164–174, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-srw.22. URL https://aclanthology.org/2021.eacl-srw.22.
  38. Japanese and korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5149–5152, 2012.
  39. AlephBERT: Language model pre-training and evaluation from sub-word to sentence level. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  46–56, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.4. URL https://aclanthology.org/2022.acl-long.4.
  40. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909, 2015. URL http://arxiv.org/abs/1508.07909.
  41. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp.  21–24, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. doi: 10.3115/v1/E14-2006. URL https://aclanthology.org/E14-2006.
  42. Super-convergence: Very fast training of residual networks using large learning rates. CoRR, abs/1708.07120, 2017. URL http://arxiv.org/abs/1708.07120.
  43. Impact of tokenization on language models: An analysis for turkish. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(4), mar 2023. ISSN 2375-4699. doi: 10.1145/3578707. URL https://doi.org/10.1145/3578707.
  44. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461, 2018. URL http://arxiv.org/abs/1804.07461.
  45. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019. URL http://arxiv.org/abs/1910.03771.
  46. Zhou, G. Morphological zero-shot neural machine translation. Master’s thesis, School of Informatics, University of Edinburgh, 2018.
Citations (2)

Summary

  • The paper introduces MorphPiece, a hybrid tokenization scheme combining BPE with morpheme-based segmentation to provide linguistically aligned token splits.
  • It leverages a MorphTable built from MorphyNet, refining 346,340 word segments into a vocabulary of 50,006 tokens to optimize language representation.
  • MorphGPT-Base, trained with MorphPiece, achieves superior results over GPT-2 with nearly 10% higher LAMBADA accuracy and improved zero-shot evaluations.

MorphPiece: A Linguistic Tokenizer for LLMs

This paper introduces MorphPiece, a novel tokenization scheme for LLMs that integrates morphological segmentation with statistical methods to improve linguistic alignment. The author posits that current tokenizers, relying primarily on statistical analysis, neglect valuable linguistic features present in natural language. By incorporating morphological information, MorphPiece aims to create more natural and efficient subword tokenizations, leading to enhanced model performance across various NLP tasks.

MorphPiece Tokenization Scheme

The MorphPiece tokenization scheme (Figure 1) combines BPE with morpheme-based segmentation. Initially, the input text undergoes normalization and pre-tokenization using standard BPE. These pre-tokens are then processed through a lookup table named MorphTable, which contains a pre-computed morphological segmentation of English words. If a segmentation is found in MorphTable, the pre-token is replaced with its corresponding morphemes; otherwise, the token is split using BPE with a custom-trained vocabulary. Figure 1

Figure 1: MorphPiece tokenization scheme integrates BPE pre-tokenization with a lookup in MorphTable for morpheme-based segmentation, falling back to BPE with a custom-trained vocabulary when no morphological segmentation is available.

MorphTable Construction

MorphTable is constructed using MorphyNet, a database of derivational and inflectional morphology for 15 languages. From this database, a lookup table of 346,340 English words segmented into morphemes was created. After filtering for entities with at least 5 occurrences in the training corpus, the vocabulary was trimmed to 18,304 tokens, with a table size of 134,943 entries.

Vocabulary Composition

The MorphPiece vocabulary comprises two components: the affixes and stems extracted from MorphTable, and a BPE vocabulary trained on OpenWebText. The BPE vocabulary is trained to a size of 32,000 tokens, aiming for a final vocabulary size of 50,257 tokens, comparable to GPT-2. Words with segmentations available in MorphTable are excluded from the BPE training corpus. The final vocabulary size is 50,006 tokens.

Tokenization Examples

The paper provides examples that highlight how MorphPiece splits words into linguistically aligned affixes, which have semantic meaning. For instance, "paratrooper" is segmented as ('para#', 'troop', '#er') in MorphPiece, aligning with the linguistic parts of the word. In contrast, BPE and WordPiece tokenizers split it into ('par', 'atro', 'oper') and ('para', '##tro', '##oper'), respectively. The location and presence/absence of the '#' symbol denotes prefixes, suffixes, compound words, and stems. The paper claims that this aligns more closely with the linguistic parts of the word compared to purely statistical tokenizers.

MorphGPT LLM

To validate the effectiveness of MorphPiece, the author trained a GPT-2 (Base) architecture with MorphPiece, named MorphGPT-Base, and compared it against the OpenAI GPT-2 model that uses BPE.

Training Details

MorphGPT-Base was trained for 200k steps using the OpenWebText corpus, with a batch size of 512 and a one-cycle learning rate scheduler. The training was performed on Nvidia A-100 GPUs using HuggingFace's implementation of GPT-2 with Pytorch-Lightning. The author estimated that GPT-2 was trained for approximately 400k-500k steps.

Evaluation Tasks

The performance of MorphGPT was evaluated on various NLP tasks, including perplexity on different datasets, the LAMBADA task, MTEB, and zero-shot prompt-based evaluations on GLUE.

Results and Comparison

MorphGPT consistently demonstrates superior performance compared to GPT-2 across almost all evaluations, despite being trained for approximately half the number of steps. Specifically, MorphGPT achieved significantly better token-level perplexity scores, with performance comparable to GPT-2 (Large) after 200k steps. On the LAMBADA task, MorphGPT surpassed the accuracy of GPT-2 by almost 10% with only 50k steps, nearly reaching the accuracy of the GPT-2 Large model. In zero-shot GLUE evaluations, MorphGPT generally outperformed GPT-2, both in raw accuracy and the number of prompt templates where it showed superior performance.

The paper also presents a comparison with FLOTA, a tokenization improvement method that attempts to preserve the morphological structure of words during tokenization. MorphGPT outperformed FLOTA comprehensively on a classification task using a custom dataset of titles from arXiv, showing improvements of more than 35% over vanilla GPT-2, compared to about 6% for GPT-2+FLOTA.

Massive Text Embedding Benchmark (MTEB)

MorphGPT was evaluated on the MTEB, which consists of 8 embedding tasks covering a total of 58 datasets. MorphGPT outperforms GPT-2 across all 7 monolingual tasks.

Detokenization Process

The paper addresses the detokenization process, which involves converting the tokens produced by a MorphPiece-trained model back into coherent sentences. The detokenization process consists of classifying tokens (Figure 2) as either 'morph' or 'bpe' based on their source. MorphPiece tokens are further annotated as prefix, suffix, stem, or hash (for compound words). Figure 2

Figure 2: An example of detokenization illustrating how tokens are classified based on their source (MorphPiece or BPE) and how word boundaries are identified using the detokenization mechanism.

The detokenization mechanism (Figure 3) uses a reverse MorphTable to convert morpheme sequences back into English words, considering various cases like compound words and multiple affixes. Figure 3

Figure 3: The detokenization mechanism illustrates the process of converting morphemes back into English words, with black lines indicating word continuation and red dashed lines indicating word boundaries.

Limitations

The paper acknowledges limitations, including incomplete coverage of lexical families in MorphyNet, the need to construct separate MorphTables and detokenization automata for each language, and a 17% increase in the number of tokens compared to BPE.

Conclusion

The author concludes that MorphPiece represents a linguistically motivated tokenization scheme that outperforms models trained on BPE across a wide variety of tasks. The paper suggests that incorporating linguistic inductive bias into tokenization can lead to a new generation of models that move away from purely statistical language representation.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Authors (1)