Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Linear Representation Hypothesis and the Geometry of Large Language Models

Published 7 Nov 2023 in cs.CL, cs.AI, cs.LG, and stat.ML | (2311.03658v2)

Abstract: Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.

Citations (85)

Summary

  • The paper introduces a formalization of the linear representation hypothesis by defining high-level concepts as linear directions in LLM representation spaces using a causal inner product.
  • Empirical validation with LLaMA-2 shows that manipulating embedding directions can effectively predict and steer model outputs.
  • The methodology bridges unembedding and embedding representations, offering actionable insights into model transparency, intervention, and control.

The Linear Representation Hypothesis and the Geometry of LLMs

Overview of the Linear Representation Hypothesis

The paper "The Linear Representation Hypothesis and the Geometry of LLMs" (2311.03658) introduces a formalization of the linear representation hypothesis within the context of LLMs. The hypothesis suggests that high-level concepts in LLMs can be represented as linear directions in the model's representation space. The paper systematically addresses ambiguities concerning what "linear representation" entails and how geometric notions like cosine similarity or projection can be meaningfully interpreted within these representation spaces. Figure 1

Figure 1: The geometry of linear representations can be understood in terms of a causal inner product that respects the semantic structure of concepts.

The authors propose two formalizations of linear representation—one in the output (word) representation space and one in the input (sentence) space—using the language of counterfactuals. They demonstrate connections between these formalizations and concepts like linear probing and model steering. Crucially, they introduce a causal inner product, which unifies representation notions and respects LLM structure.

Formalizations and Contributions

Unembedding Representations

The paper starts by formalizing unembedding representations, where high-level concepts are encoded as directions within the output word space. When concepts like "male" versus "female" are considered, the direction in the semantic vector space between "king" and "queen" or "man" and "woman" are examples of linear relationships.

The authors use causal reasoning to formalize these relationships, introducing the idea that changes in context (binary concept variables) define specific counterfactual outputs. They demonstrate that LLMs encode these directions, aligning with empirical observations of linear subspaces in word embeddings.

Embedding Representations

Similarly, embedding representations relate to the input phrase space, where context representations can be altered to reflect concept changes. This formalization connects to intervention methods, such as changing the output language from French to English by vector addition in the representation space, demonstrating model steering capabilities.

Causal Inner Product

The paper's pivotal contribution is the introduction of a causal inner product—an inner product tailored to recognize orthogonality among causally separable concepts. Using theoretical underpinnings, the authors show how this inner product provides a unified and coherent view of embedding and unembedding representations. Figure 2

Figure 2: Projecting counterfactual pairs onto their corresponding concept direction shows a clear strong right skew, as expected if the linear representation hypothesis holds.

Empirical Validation and Experiments

Concept Subspace Representation

Experiments conducted with LLaMA-2 LLM validate the proposed hypothesis, showing strong alignment of concept pairs in the unembedding space. Projections of counterfactual pairs onto their concept directions reveal substantive correlations, supporting the linear representation hypothesis. Figure 3

Figure 3: The subspace representation bar\gamma_W acts as a linear probe for W, illustrating the efficacy of concept directionality.

Measurement and Intervention

The paper demonstrates that concept representations act effectively as linear probes, providing predictive insights into LLM outputs. Additionally, embedding representations created from subspaces can be used to alter model outputs in controlled experiments, validating the intervention notion linkage. Figure 4

Figure 4: Adding alpha\bar\lambda_C to lambda changes the target concept C without changing off-target concepts, affirming the role of causally separable vectors.

Inner Product Practical Implications

Estimation Approach

The authors propose a method for estimating the causal inner product using the unembedding matrix and explore the implications, such as the need to optimize the choice of inner product for practical applications, given its role in capturing semantic and causal relationships effectively. Figure 5

Figure 5: Causally separable concepts are approximately orthogonal under the estimated causal inner product, illustrating semantic coherence.

Conclusion

The paper presents a thorough examination of the linear representation hypothesis, providing formalizations that integrate with existing notions in LLM interpretation and control. By establishing the role of a causal inner product, the research articulates a unified framework, potentially influencing future developments in AI model transparency and manipulation. The implications for AI frameworks are substantial, offering pathways to refine our understanding of LLMs while presenting practical methodologies for probing and steering model outputs.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper studies a simple, powerful idea about LLMs like LLaMA-2: many high-level ideas or “concepts” (like male vs. female, English vs. French, past vs. present tense) are stored inside the model as straight-line directions in a mathematical space. If that’s true, we can understand and even control what the model does by doing basic geometry with vectors (arrows) in that space.

The authors make this idea precise, show how different ways of using these directions are connected, and introduce a better way to measure angles and lengths in this space so that the geometry matches how language actually works.

Key Questions

The paper focuses on two friendly questions:

  • What exactly does it mean for a language concept to be “linear,” i.e., a direction in the model’s representation space?
  • What is the right way to measure similarity and do projections in that space, so that the geometry reflects real language structure?

How They Studied It (Methods)

To keep things concrete, here are the main ideas they use.

Concepts and “counterfactual pairs”

  • Think of a concept as something you can switch without changing other things. For example, switching “king” to “queen” changes gender but not the idea of royalty; switching English to French changes language but not meaning.
  • A “counterfactual pair” is a pair of words or phrases that are the same except for one concept. Examples:
    • Gender: (king, queen), (man, woman)
    • Language: (king, roi), (woman, femme)
    • Case: (king, King)
  • These pairs act like before/after snapshots that isolate just one concept change.

Two spaces inside an LLM

LLMs have two relevant vector spaces:

  • Input/Context space (embedding space): vectors that represent the current sentence or context the model is reading.
  • Output/Word space (unembedding space): vectors that represent the possible next words.

The paper defines a “linear representation” of a concept in both spaces:

  • In the output (word) space: concept = one direction all relevant counterfactual pairs line up with. For example, arrows like “queen − king” and “woman − man” point roughly in the same direction: the gender direction.
  • In the input (context) space: concept = one direction that, when added to the context vector, raises the chance of that concept showing up (e.g., switching outputs toward “queen” over “king”) without affecting unrelated concepts.

These two views connect to two useful tools:

  • Measuring a concept (linear probing): use a simple dot product with a direction to predict the concept (e.g., “is this context French?”).
  • Steering a concept (intervention/editing): add a direction to push the output toward a concept (e.g., make the next word more likely to be French or female).

Choosing the right “ruler”: a causal inner product

  • To compare directions (similarity, angle, projection), you need a “ruler”—a way to measure lengths and angles called an inner product. The usual one is Euclidean, but the authors argue this may not match language structure.
  • They propose a “causal inner product” that makes causally independent concepts perpendicular. For example, language (English vs. French) should be perpendicular to gender (male vs. female), because you can vary them separately.
  • They show how to estimate a practical version of this inner product from the model’s word vectors using statistics (essentially whitening: use the inverse of the word-vector covariance matrix). This turns out to better reflect true separations between concepts.

Experiments with LLaMA-2

They test these ideas on LLaMA-2 (7B):

  • Build concept directions from many counterfactual pairs, including classic word analogies (like king:queen = man:woman) and language pairs (English–French, French–Spanish, etc.).
  • Show that these directions:
    • Align well across many pairs for the same concept (evidence of linearity).
    • Act as simple linear probes to detect concepts in context.
    • Serve as steering vectors that shift the model’s next-word probabilities toward the target concept without disturbing unrelated concepts.
  • Compare different inner products (Euclidean vs. their causal one) and show the causal one better matches meaningful separations between concepts.

What They Found (Results)

Here are the main takeaways:

  • Linear concept directions exist. Across many word pairs, the difference vectors roughly point in the same direction for the same concept. So “concepts as directions” is a good model.
  • Measurement and control are two sides of the same coin. The concept direction in the output space explains why simple linear probes work. The concept direction in the input space explains why adding a vector can steer the model’s behavior.
  • A better geometry matters. Using their causal inner product:
    • Concepts that can vary independently (like language vs. gender) become perpendicular, which makes sense.
    • The two kinds of concept directions (input and output) can be unified, so you can build steering vectors from word-space directions in a principled way.
    • Similarity heatmaps show clearer blocks and separations that match real linguistic structure, better than the default Euclidean view.
  • Practical steering works. Adding the “female” direction to the context increases the chance of “queen” over “king,” while leaving capitalization or language largely unchanged. Similarly for language directions, and so on.

Why It Matters (Implications)

This work gives a cleaner, more reliable way to understand and control LLMs:

  • Interpretability: You can discover concept directions from counterfactual word pairs, then use them to probe what the model “thinks” about a context.
  • Controllability: You can edit model behavior in predictable ways (e.g., nudge the next word toward French or toward feminine forms) without accidentally changing unrelated properties.
  • Better tools via better geometry: Picking the right inner product—the “causal” one—makes the math line up with how language concepts combine, leading to more meaningful similarity and projection operations.
  • Unifying ideas: The paper ties together three popular notions—subspace directions, linear probes, and steering vectors—into one consistent picture.

In simple terms: the paper shows that many language features inside LLMs behave like sliders on a soundboard. If you know the right direction for a concept, you can read it (probe) or adjust it (steer). And if you use the right “ruler” to measure directions, those sliders become cleanly separated, making interpretation and control safer and more dependable.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 227 likes about this paper.