Emergent Mind

Structured World Representations in Maze-Solving Transformers

(2312.02566)
Published Dec 5, 2023 in cs.LG and cs.AI

Abstract

Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of their inner workings remains a significant challenge. To this end, we set out to understand small transformer models in a more tractable setting: that of solving mazes. In this work, we focus on the abstractions formed by these models and find evidence for the consistent emergence of structured internal representations of maze topology and valid paths. We demonstrate this by showing that the residual stream of only a single token can be linearly decoded to faithfully reconstruct the entire maze. We also find that the learned embeddings of individual tokens have spatial structure. Furthermore, we take steps towards deciphering the circuity of path-following by identifying attention heads (dubbed $\textit{adjacency heads}$), which are implicated in finding valid subsequent tokens.

Sweep results show models perform well without linear maze representation; top models exhibit one.

Overview

  • The study examines how simple transformer AI models can learn to solve mazes, potentially offering insights into the strategies of more complex transformer models.

  • Two transformer models were trained on different types of mazes, using tokens to represent maze pathways and solving them as a sequence prediction problem.

  • Interpretability techniques unveiled that transformers can internalize maze layouts in their hidden layers with specific 'adjacency heads' playing a pivotal role.

  • 'Grokking'-like spikes in performance indicated a link between structured understanding of mazes and the generalization abilities of the transformers.

  • The research contributes to the broader goal of understanding if AI can abstract complex structures, informing the design of network architectures for various tasks.

Introduction to Transformers in Maze-Solving

Transformers, originally designed for language processing, are intriguing for their potential across various tasks, including maze-solving. The bulk of research has disassembled these AI networks to grasp their learning mechanisms better. Notably, mechanistic components have been discovered, such as induction heads that support sequence completion. Understanding simple transformers—the starting point of this study—could offer clues about the learning strategies of their complex counterparts.

Experimenting with Maze Transformers

This paper's investigation involves small transformers learning mazes through a series of tokens. Two primary models were trained: a smaller one on "forkless" mazes and a larger one on complex mazes with multiple decision points. Maze generation covered randomized depth-first search (RDFS) algorithms and its variants to represent a variety of connectivity challenges. The models’ task, essentially predicting the correct token sequence to solve the maze, could be likened to offline reinforcement learning with a textual flair.

Insights from Interpretability Techniques

The study applied multiple interpretability methods to unpack the transformer models. It turned out that a single token's residual stream could be linearly decoded to reconstruct the maze's layout. This implies an internalized representation within the transformer's hidden layers. Attention analysis revealed specific heads, named "adjacency heads," responsible for understanding the maze's pathways. These findings suggest that even a single layer could possess information about the entire structure of the maze, which is quite a fascinating emergence within an AI's cognition.

Understanding Model Learning and Representation

Probing experiments showed that models indeed learn structured representations of mazes. The study observed 'grokking'-like spikes, where performance abruptly improved—often coinciding with an enhanced ability to decode maze representations. This hints at a potential causal link between structured understanding and generalization capabilities of transformers.

In essence, the study propelled forward the understanding of how transformers can abstract complex environmental structures, reflected in their puzzle-solving efficiency. The outcome serves as a springboard for future explorations into whether these representations are a common thread across various network designs and tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube