Emergent Mind

Abstract

The widespread adoption of LLMs makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. This approach - which we call the teleological approach - leads us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. We predict that LLMs will achieve higher accuracy when these probabilities are high than when they are low - even in deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs (GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system - one that has been shaped by its own particular set of pressures.

Diagram shows shared and unique properties of humans and LLMs, highlighting limitations of human-centric tests.

Overview

  • LLMs like GPT-3.5 and GPT-4 are predominantly trained to predict the next word in a text sequence.

  • LLM performance is influenced by probabilities related to task frequency and the likelihood of generated outputs over the inputs received.

  • Empirical research on LLMs shows a strong correlation between task success and both task frequency in training data and output probability.

  • LLMs demonstrate challenges with tasks requiring physical interaction and show sensitivity to the exact phrasing of the input.

  • The paper suggests that while advanced prompting and scaling can improve LLM performance, inherent biases linked to their training require careful consideration.

Understanding LLMs

Training Objectives and LLM Behavior

The widespread deployment of LLMs like GPT-3.5 and GPT-4 necessitates an understanding of their strengths and limitations. It is posited that to truly grasp the capabilities of LLMs, one must consider the problem these models have been trained to solve: predicting the next word in a sequence, using Internet text as a substrate. Recognizing this training goal—the essence of their autoregressive nature—and the environment of their operation leads to insights about their performance.

Factors Influencing LLM Performance

Research presents a "teleological" approach, prioritizing the goals and environment that shape LLMs. This perspective presupposes LLM accuracy is influenced by:

  • Task probability: LLMs excel at tasks reflecting high-frequency examples in training data.
  • Output probability: Deterministic tasks notwithstanding, models lean towards higher accuracy for more probable outputs.
  • Input probability: Effectiveness may be impacted by the provided input's likelihood, although less than output probability.

Empirical Validation

Evaluations encompass eleven distinct tasks, revealing three key influences:

  1. LLM accuracy aligns with task frequency; common tasks bring greater success than their rare counterparts.
  2. Even when tasks don't rely on it, the probability of target outputs can unexpectedly dictate LLM performance.
  3. While input probability partially shapes LLM behavior, it's overshadowed by the decisive sway of output probability.

What stands out is an asymmetry; models are more affected by the likelihood of what they generate (outputs) than by the likelihood of the information they receive (inputs).

Beyond Probability: Other Characteristic Phenomena

  • Lack of Embodiment: LLMs may fumble tasks easily solved by humans using physical interaction, e.g., applying a keyboard-based cipher.
  • Sensitivity to Wording: The exact phrasing, even for similar ideas, can elicit divergent LLM responses, revealing a heavy reliance on language patterns.

Implications for LLM Application

The work advises caution when employing LLMs for rare tasks (due to probability biases) and situations requiring low-probability text generation. Advanced prompting strategies and scaling might uplift model performance, but fundamental tendencies persist, stressing the need for an approach informed by the intrinsic training nature of LLMs.

Closing Thoughts

As LLMs continue to advance in capability, comprehending their ingrained biases and operational nuances becomes more critical. This paper underscores the importance of aligning LLM evaluations with their foundational training aspects to navigate their capabilities and boundaries accurately.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube