Impact of Pretraining Term Frequencies on Few-Shot Reasoning (2202.07206v2)

Published 15 Feb 2022 in cs.CL and cs.LG

Abstract: Pretrained LLMs (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based LLMs (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above $70\%$ (absolute) more accurate on the top 10\% frequent terms in comparison to the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.

Authors (4)

Yasaman Razeghi (8 papers)
Robert L. Logan IV (13 papers)
Matt Gardner (57 papers)
Sameer Singh (96 papers)

Citations (140)

View on Semantic Scholar

Summary

The paper demonstrates that models achieve over 70% more accuracy on frequent terms compared to infrequent ones in numerical reasoning tasks.
It uses term frequency counts from the Pile dataset to correlate occurrences with performance on arithmetic and unit conversion challenges.
Findings underline the need for revised training and evaluation protocols to mitigate data biases in few-shot reasoning capabilities.

Overview: Impact of Pretraining Term Frequencies on Few-Shot Reasoning

In the paper "Impact of Pretraining Term Frequencies on Few-Shot Reasoning," the authors examine the extent to which the reasoning capabilities of Pretrained LLMs (LMs) are influenced by term frequencies in their pretraining data. The research focuses on the performance of GPT-based models on numerical reasoning tasks, including arithmetic operations and unit conversion. The authors investigate whether these models rely on robust reasoning or merely mimic patterns from the pretraining data.

The paper centers on the correlation between the frequency of specific terms in the pretraining dataset and the models' performance on reasoning tasks involving those terms. Experiments reveal that models perform significantly better on instances with frequent terms than on those with infrequent ones; for instance, models can be over 70% more accurate on the most frequent terms in comparison to the least frequent terms. This substantial performance gap suggests that frequent terms have a disproportionate influence on model output, calling into question the generalization capabilities of LMs beyond their pretraining data.

Methodology

The methodology involves calculating term frequencies in the Pile dataset, the pretraining corpus for the GPT-based models used. By counting occurrences of terms and combinations thereof, the authors analyze whether these frequencies correlate with model accuracy across various few-shot reasoning tasks. The tasks include standard arithmetic, operation inference, and time unit conversion. The performance gap between high-frequency and low-frequency terms is quantified and serves as a metric to assess the influence of pretraining data on model output.

Key Findings

The results illustrate that the pretrained models exhibit a significant performance gap based on term frequency, indicating that their reasoning capabilities are closely linked to the data distribution in their pretraining corpus. Notably, even as the models’ size increased from GPT-Neo-1.3B to GPT-J-6B, the reliance on frequent terms persisted, albeit with improved overall accuracy in larger models. This suggests a persistent challenge in decoupling reasoning abilities from memorization of pretraining data patterns.

Implications and Future Directions

The implications of this research are critical for understanding the limitations of LLMs' few-shot reasoning capabilities. The demonstrated reliance on term frequency underscores the necessity for robust evaluation methods that account for pretraining data characteristics. The authors recommend incorporating measures of pretraining frequency into evaluation protocols to better gauge genuine reasoning abilities. Furthermore, these findings provoke broader questions about the training and deployment of LLMs in various applications, particularly where independence from pretraining data biases is paramount.

Future developments in AI could explore methodologies to mitigate the influence of term frequencies, such as adjusting training algorithms to promote generalized reasoning or altering evaluation frameworks. Additionally, expanding analyses to include other types of reasoning tasks, beyond numerical, could offer comprehensive insights into the reasoning strengths and weaknesses of LLMs.

In summary, this paper sheds light on an important aspect of LLM evaluation, encouraging the research community to consider pretraining data characteristics when interpreting model performance. The identified correlation between term frequency and model reasoning questions the extent of genuine inference capabilities, suggesting the need for refined training and assessment techniques in AI development.

PDF Markdown

Related Papers

Tweets

https://twitter.com/BlancheMinerva/status/1765050419502252441

https://twitter.com/BlancheMinerva/status/1743026734155313384

https://twitter.com/Amir_Arsalan_/status/1755765835996479888

YouTube

Show All Videos