Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

98 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve (2309.13638v1)

Published 24 Sep 2023 in cs.CL and cs.AI

Abstract: The widespread adoption of LLMs makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. This approach - which we call the teleological approach - leads us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. We predict that LLMs will achieve higher accuracy when these probabilities are high than when they are low - even in deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs (GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system - one that has been shaped by its own particular set of pressures.

References (179)

Citations (114)

View on Semantic Scholar

Summary

The paper demonstrates that LLM accuracy primarily depends on the likelihood of generated outputs, more so than on input probability.
It employs empirical evaluations across eleven tasks to show that common tasks yield higher performance due to frequency biases.
The study advises adopting advanced prompting and scaling strategies to mitigate inherent autoregressive limitations in rare tasks.

Understanding LLMs

Training Objectives and LLM Behavior

The widespread deployment of LLMs like GPT-3.5 and GPT-4 necessitates an understanding of their strengths and limitations. It is posited that to truly grasp the capabilities of LLMs, one must consider the problem these models have been trained to solve: predicting the next word in a sequence, using Internet text as a substrate. Recognizing this training goal—the essence of their autoregressive nature—and the environment of their operation leads to insights about their performance.

Factors Influencing LLM Performance

Research presents a "teleological" approach, prioritizing the goals and environment that shape LLMs. This perspective presupposes LLM accuracy is influenced by:

Task probability: LLMs excel at tasks reflecting high-frequency examples in training data.
Output probability: Deterministic tasks notwithstanding, models lean towards higher accuracy for more probable outputs.
Input probability: Effectiveness may be impacted by the provided input's likelihood, although less than output probability.

Empirical Validation

Evaluations encompass eleven distinct tasks, revealing three key influences:

LLM accuracy aligns with task frequency; common tasks bring greater success than their rare counterparts.
Even when tasks don't rely on it, the probability of target outputs can unexpectedly dictate LLM performance.
While input probability partially shapes LLM behavior, it's overshadowed by the decisive sway of output probability.

What stands out is an asymmetry; models are more affected by the likelihood of what they generate (outputs) than by the likelihood of the information they receive (inputs).

Beyond Probability: Other Characteristic Phenomena

Lack of Embodiment: LLMs may fumble tasks easily solved by humans using physical interaction, e.g., applying a keyboard-based cipher.
Sensitivity to Wording: The exact phrasing, even for similar ideas, can elicit divergent LLM responses, revealing a heavy reliance on language patterns.

Implications for LLM Application

The work advises caution when employing LLMs for rare tasks (due to probability biases) and situations requiring low-probability text generation. Advanced prompting strategies and scaling might uplift model performance, but fundamental tendencies persist, stressing the need for an approach informed by the intrinsic training nature of LLMs.

Closing Thoughts

As LLMs continue to advance in capability, comprehending their ingrained biases and operational nuances becomes more critical. This paper underscores the importance of aligning LLM evaluations with their foundational training aspects to navigate their capabilities and boundaries accurately.

PDF Markdown

Tweets

https://twitter.com/tdietterich/status/1796030106592260233

https://twitter.com/tdietterich/status/1806042383005405564

https://twitter.com/tdietterich/status/1803285217026908430

https://twitter.com/johnjhorton/status/1749803998054601015

https://twitter.com/MelMitchell1/status/1777471922063753553

https://twitter.com/tdietterich/status/1777783257121087765

YouTube

Show All Videos