Language Model Evaluation Beyond Perplexity (2106.00085v3)

Published 31 May 2021 in cs.CL

Abstract: We propose an alternate approach to quantifying how well LLMs learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from LLMs exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a framework--paired with significance tests--for evaluating the fit of LLMs to these trends. We find that neural LLMs appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the type--token relationship of natural language than text produced using standard ancestral sampling; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.

Citations (64)

View on Semantic Scholar

Summary

The paper introduces a framework that evaluates language models by comparing their output to natural language distributions such as Zipf's and Heaps' laws.
Experiments show that model architecture and sampling strategies, like nucleus sampling, significantly influence adherence to empirical linguistic trends.
Findings reveal that while transformers and LSTMs capture some statistical properties effectively, beam sampling yields poorer alignment overall.

LLM Evaluation Beyond Perplexity

Introduction

The paper "LLM Evaluation Beyond Perplexity" (2106.00085) presents an alternative approach to evaluating how well LMs learn natural language by examining the alignment of generated texts with the statistical tendencies of human language. This paper dives deeply into various linguistic phenomena beyond traditional metrics such as perplexity, offering a framework for quantifying model adherence to empirical patterns. The investigation reveals that while LMs capture certain statistical aspects of language, this capability varies profoundly based on architectural and generational nuances.

Statistical Tendencies in LLMs

The authors aim to evaluate LMs by how closely the generated text aligns with known statistical laws and distributions of natural language, such as Zipf's law and Heaps' law, among other empirical tendencies.

Classical Laws

Zipf's Law: The rank-frequency distribution of words in natural language typically follows Zipf's law, where frequency is inversely proportional to rank. The paper discusses the empirical adherence of model-generated text to this law.
Heaps' Law: Describes the non-linear relationship between the number of unique words and document length. The assessment involves fitting a non-homogeneous Poisson process to model the type-token relationship.
Figure 1: Average number of unique words vs. document length, i.e., type-token, in text sampled from LLMs. Values from models' test set are plotted for reference.

Other Tendencies

The authors extend their analysis to other text characteristics like length, stopwords, and symbols distribution. These factors do not have fixed theoretical laws and are assessed based on empirical distribution comparisons.

Statistical Distances

The paper employs several probability metrics to quantify the fit between the LM-generated text and human language distributions, using:

Kolmogorov-Smirnov (KS) Metric: Measures differences between empirical cumulative distribution functions (cdfs).
Total Variation Distance (TVD): Evaluates discrepancies in probability distributions.

These metrics facilitate hypothesis testing, examining both the goodness of fit to known distributions and the model-generated versus naturally occurring empirical distributions.

Experiments and Results

Rank-Frequency

The empirical rank-frequency adherence to Zipfian distribution is measured with KS metrics. Results indicate that transformer and LSTM models align closely with natural language but deviate more significantly from an ideal Zipfian distribution.

Figure 2: Rank-frequency distributions for different samples. All follow a remarkably similar trend.

Type-Token Relationship

The type-token trend is examined using KS metrics for varying document lengths. Transformer architectures with nucleus sampling demonstrate closer alignment with natural language type-token distributions.

Figure 3: KS metrics (lower implies closer fit) with reference distributions for the type-token relationship as a function of document length.

Unigram, Length, Stopword, and Symbol Distributions

For unigram distributions, TVD indicates that most LMs capture natural language distribution effectively. Interestingly, LSTMs showcased notable fidelity to empirical distributions in length, stopwords, and symbols. However, beam sampling strategies typically exhibited poor alignment.

Figure 4: Boxplots showing the distribution of sample length and stopword and symbol percentages.

Discussion and Conclusion

The paper concludes that the selected methodology provides richer insights into an LM's proficiency in capturing natural language nuances compared to perplexity alone. The research illustrates that alignment with natural language distributions is contingent on model architecture and generation strategies. A key takeaway is the advantage in using techniques like nucleus sampling, which improve statistical adherence across several metrics.

Future studies could explore cross-linguistic analyses and the impact of dataset size on distribution alignment. Additionally, examining the correlation of these metrics with other evaluation methodologies may refine our understanding of a model's overall linguistic comprehension. The work underscores the limited scope of perplexity and proposes a more nuanced evaluation framework crucial for advancing LLM assessments.