Emergent Mind

Abstract

How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.

Self-attention mechanisms in the proposed model showing how each token attends to others in the sequence.

Overview

  • The paper investigates how LLMs, specifically an Othello-playing model named OthelloGPT, create internal linear representations of game states and decision-making processes.

  • Researchers used probing techniques to decode game board states from OthelloGPT's internal activations, finding high accuracy in representing the board as 'mine' vs 'yours' rather than just black and white tiles.

  • The study also demonstrates that simple vector arithmetic can effectively manipulate the model's understanding and behavior, providing greater interpretability and control over AI decision-making.

Understanding Linear Representations in OthelloGPT: A Peek into Sequence Models

Introduction

LLMs are powerful, but they often feel like black boxes. They make decisions in ways that aren't always clear, which can be frustrating if you're trying to understand or improve them. This paper dives into how these models internally represent decision-making processes and reveals something quite interesting about an Othello-playing model named OthelloGPT.

Demystifying OthelloGPT

The Basics of Othello

Othello is a two-player game played on an 8x8 grid. Players take turns placing discs on the board, with the goal of ending up with the majority of their color on the board. Each move involves flipping the opponent’s discs sandwiched between two of your discs in a row.

The OthelloGPT Model

OthelloGPT is an 8-layer transformer model trained to predict legal moves in Othello based on sequences of prior moves. It was trained autoregressively, meaning it learned to predict the next move given the sequence of all prior moves. Crucially, it had no prior knowledge of the game's rules—just the sequences of moves.

Linear Representations: What's the Buzz?

Experiment Setup

The researchers behind this paper explored whether OthelloGPT encoded its understanding of the game board in a linear way. They trained probes—simple models to interpret hidden states—on sequences of Othello games to decode the game board state directly from the model’s internal activations.

Key Findings

OthelloGPT doesn't just learn the board's abstract layout but does so linearly and relative to the current player:

  • Board State Encoding: Instead of tracking tiles as simply black or white, it represents them as "mine" versus "yours" (i.e., the current player's versus the opponent's).
  • High Accuracy: The paper reports exceptionally high probe accuracy, with nearly perfect results from layer 4 onwards.

Take a look at these probing results for various methods:

| Method                      | Accuracy (Layer 7) |
|--|-|
| Randomized Baseline         | 34.5%              |
| Probabilistic Baseline      | 61.8%              |
| Linear (Black, White, Empty)| 74.4%              |
| Non-Linear (Black, White, Empty)| 98.3%         |
| Linear (Mine, Yours, Empty) | 99.5%              |

Linear (Mine, Yours, Empty) clearly outperforms the others.

Changing the Model's Mind

Intervention Technique

The researchers also showed they could manipulate OthelloGPT's understanding by performing vector arithmetic on its hidden states. By nudging the activations in specific directions—such as "mine" or "empty"—they could effectively "deceive" the model into believing the board was in a different state.

Practical Success

This technique showed practical control over the model’s predictions, allowing them to change OthelloGPT's behavior more simply and interpretably than earlier methods, which often relied on gradients and more complex manipulation.

Linear Interpretations Unlock More Insights

Empty Tile Detection

Empty tiles turned out to be another linearly encoded feature. They showed that specific attention heads in the first layer “broadcast” which moves have been played, and this helps the model know which tiles are empty.

Flipped Tiles

The model also linearly encodes which tiles get flipped on each move. This discovery opens doors to even more detailed understanding.

Multiple Circuits Hypothesis

Interestingly, the researchers found that as the game approaches the end, OthelloGPT sometimes predicts legal moves before fully computing the board state. This suggests it might use simpler, quicker circuits when possible, especially in positions where the entire board doesn’t need to be meticulously calculated.

Key Takeaways and Implications

  1. Better Interpretability: These insights into linear representations make the model less of a black box.
  2. Improved Control: Simple vector arithmetic can steer the model’s decisions, making it a useful tool for controlled AI applications.
  3. Theoretical Implications: This work supports the hypothesis that LLMs often use linear representations internally, providing a base for further research.

Looking Forward

The study of linear representations in AI is still in its early days, but this paper shows it's a fruitful area. Future research could explore why these linear representations emerge and how they can be harnessed for more applications.

By understanding these inner workings, data scientists can better harness the power of LLMs, applying them more effectively and confidently in various fields. It's a fascinating peek under the hood of these sophisticated models and a step towards making AI more transparent and controllable.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.