Emergent Mind

On the Origins of Linear Representations in Large Language Models

(2403.03867)
Published Mar 6, 2024 in cs.CL , cs.LG , and stat.ML

Abstract

Recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of LLMs. In this work, we study the origins of such linear representations. To that end, we introduce a simple latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to show that the next token prediction objective (softmax with cross-entropy) and the implicit bias of gradient descent together promote the linear representation of concepts. Experiments show that linear representations emerge when learning from data matching the latent variable model, confirming that this simple structure already suffices to yield linear representations. We additionally confirm some predictions of the theory using the LLaMA-2 large language model, giving evidence that the simplified model yields generalizable insights.

Steering vectors of identical concepts align nontrivially in LLaMA-2, contrasting with orthogonal different concept representations.

Overview

  • The paper introduces a theoretical framework to explain the emergence of linear representations in LLMs through a latent variable model.

  • It shows that linear representations in LLMs arise due to log-odds matching and the implicit bias of gradient descent in the optimization process.

  • The research finds that unrelated concepts are represented orthogonally in the models, aligning with empirical observations.

  • Empirical validation on simulated data and analysis on the LLaMA-2 model support the theoretical insights regarding linear and orthogonal representations.

On the Origins of Linear Representations in LLMs

In the landscape of interpretability research for language models, the encoding of high-level semantic concepts within model representations presents a fascinating area of study. A recurring observation in this domain is the linear nature of these representations. This post explore a paper that provides a theoretical framework for explaining the emergence of such linear representations in LLMs.

Latent Variable Model for LLMs

The paper introduces a latent variable model designed to abstract and analyze the concept dynamics inherent in next token prediction tasks—central to the functioning of LLMs. This model posits a latent space, represented as a set of binary variables, each embodying a distinct 'concept.' These latent concepts, ranging from grammatical structures to thematic elements, serve as the underlying drivers for the generation of tokens (words or characters) and context sentences.

Crucially, the model captures the relationship between context sentences, latent concepts, and next tokens through a formal structure. It assumes that each context sentence conveys partial information about the latent concepts, which, in turn, probabilistically determine the next token. The learning objective for LLMs, thus, focuses on accurately estimating these conditional probabilities.

Insights into Linear Representations

The paper rigorously shows that under this model, concepts are indeed linearly represented in the learned representation space. This phenomenon is discussed from two key perspectives:

  1. Log-Odds Matching: Mirroring findings from earlier research on word embeddings, the paper demonstrates that a condition known as 'log-odds matching' leads to linear structures. This condition implies that the learned conditional probabilities closely mirror the actual probabilities, promoting a linear structure among concept representations.
  2. Implicit Bias of Gradient Descent: More significantly, the paper highlights the role of gradient descent's implicit bias in fostering linear representations. It elucidates that optimizing specific sub-tasks within the LLM objective, with gradient descent, naturally gravitates toward linearly encoding concepts in the representation space.

The practical implications of these results are profound. They suggest that the observed linear structure of concept representations in LLMs is not an artifact of model architecture but arises due to the learning dynamics and the optimization process.

Orthogonal Representations of Concepts

An interesting extension of the discussion on linear representations is the exploration of concept orthogonality. The paper brings to light how unrelated concepts—those not sharing direct probabilistic dependencies—tend to be represented orthogonally within the unembedding space. This finding aligns with empirical observations of semantic structures captured by Euclidean geometry in LLMs, notwithstanding that the training objectives do not explicitly identify Euclidean inner products.

Empirical Validation

The theoretical insights are further substantiated through experiments conducted on simulated data, confirming the emergence of linear and orthogonal representations in accordance with the predictions of the latent variable model. Additionally, analyses performed on the LLaMA-2 model reveal alignment between embedding and unembedding representations for matching concepts, lending further credence to the paper's theoretical contributions.

Concluding Remarks

This paper makes significant strides in demystifying the phenomenon of linearly encoded representations in LLMs. By leveraging a simple yet effective latent variable model, it provides a compelling theoretical basis for understanding how high-level semantic concepts are represented within these models. Moreover, the findings underscore the intricate interplay between model learning objectives, optimization dynamics, and the resultant geometrical structure of representations.

The implications of this research are far-reaching, opening avenues for further inquiries into the interpretability of LLMs and the optimization strategies that shape their learning process. It invites us to reevaluate our understanding of how abstract concepts are encoded and manipulated within the confines of large-scale machine learning models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.