NEO-BENCH: Evaluating Robustness of Large Language Models with Neologisms (2402.12261v4)

Published 19 Feb 2024 in cs.CL

Abstract: The performance of LLMs degrades from the temporal drift between data used for model training and newer text seen during inference. One understudied avenue of language change causing data drift is the emergence of neologisms -- new word forms -- over time. We create a diverse resource of recent English neologisms by using several popular collection methods. We analyze temporal drift using neologisms by comparing sentences containing new words with near-identical sentences that replace neologisms with existing substitute words. Model performance is nearly halved in machine translation when a single neologism is introduced in a sentence. Motivated by these results, we construct a benchmark to evaluate LLMs' ability to generalize to neologisms with various natural language understanding tasks and model perplexity. Models with later knowledge cutoff dates yield lower perplexities and perform better in downstream tasks. LLMs are also affected differently based on the linguistic origins of words, indicating that neologisms are complex for static LLMs to address. We will release our benchmark and code for reproducing our experiments.

Citations (3)

View on Semantic Scholar

Summary

The paper presents Neo-Bench, a framework that systematically evaluates the impact of neologisms on LLM performance, notably in translation tasks.
The paper employs a comprehensive dataset of 2,505 neologisms, covering lexical, morphological, and semantic types to benchmark model responses.
The paper reveals that larger, more recently trained LLMs better handle neologism challenges, emphasizing the need for dynamic language model updates.

Evaluating Robustness of LLMs with Neologisms: Insights from Neo-Bench

The paper introduces a framework, Neo-Bench, designed to systematically evaluate the performance of LLMs when confronted with neologisms—newly coined terms that have not been present in the training data of these models. This paper addresses a crucial facet of language evolution that affects the temporal drift seen in LLMs, particularly as existing models struggle to accurately process and adapt to the dynamic nature of language changes.

Neologism Impact on LLM Performance

Neologisms present a unique challenge for LLMs primarily due to their emergence post-training of the models. This paper notes a significant drop in performance across tasks such as machine translation when sentences include neologisms. For example, machine translation output quality can be approximately halved with the inclusion of even a single neologism in the input text. The authors highlight the misalignment and adaptation issues these models face as a result of the linguistic novelty of neologisms, which are inherently absent in traditional training datasets.

Neo-Bench Benchmark

Neo-Bench comprises a diverse collection of 2,505 neologisms, sourced using multiple methodologies. The benchmark is unique in its comprehensive nature, addressing three distinct types of neologisms: lexical, morphological, and semantic. It evaluates LLMs on a variety of tasks, including perplexity measures, Cloze Question Answering, Definition Generation, and Machine Translation. The authors posit that models trained with later knowledge cut-off dates demonstrate decreased perplexities and better downstream task performance, which is indicative of potential improvements in processing linguistic novelties when trained on more updated datasets.

Findings and Implications

Key findings from the Neo-Bench evaluation reveal that older LLMs, like BART and T5, experience significant degradation when handling neologisms, showcasing poor performance compared to more recent models. Moreover, metrics traditionally used for machine translation do not reliably capture the translation quality of texts containing neologisms, underlining a marked discrepancy between automated evaluations and human judgments.

The paper also elucidates that larger models usually perform better with neologisms, suggesting that increased model capacity can partially mitigate the challenges posed by language change. The complexity of handling neologisms further varies with the linguistic nature of the neologism—semantic neologisms tend to produce literal translations, while morphological neologisms show the best segmentation and tokenization performance.

Future Directions

The insights garnered from Neo-Bench open avenues for advancing LLM research, particularly in developing methods to incorporate more recent language data effectively into models to reduce temporal drift. The predictive power of LLMs in understanding new linguistic constructs could be enhanced by dynamically retraining on continuously updated corpora or introducing real-time adaptation mechanisms. Furthermore, improving automatic evaluation metrics to better assess the quality of translations involving neologisms remains a crucial field of inquiry.

Conclusion

The paper provides a nuanced understanding of how neologisms challenge the robustness of LLMs and underscores the importance of adaptability and agility in LLMing. Neo-Bench serves as a valuable tool for future research, with implications for improving the robustness and precision of LLMs in real-world applications where language evolves rapidly. This research lays critical groundwork for addressing the challenges of temporal misalignment, emphasizing the need for continuous refinements in dataset curations and training methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JonathanQZheng/status/1770130278872142189

https://twitter.com/alan_ritter/status/1770942002387161512

https://twitter.com/JonathanQZheng/status/1822430537778335945

https://twitter.com/knishimae0531/status/1770380421844632028