Emergent Mind

In-Context Language Learning: Architectures and Algorithms

(2401.12973)
Published Jan 23, 2024 in cs.CL and cs.LG

Abstract

Large-scale neural language models exhibit a remarkable capacity for in-context learning (ICL): they can infer novel functions from datasets provided as input. Most of our current understanding of when and how ICL arises comes from LMs trained on extremely simple learning problems like linear regression and associative recall. There remains a significant gap between these model problems and the "real" ICL exhibited by LMs trained on large text corpora, which involves not just retrieval and function approximation but free-form generation of language and other structured outputs. In this paper, we study ICL through the lens of a new family of model problems we term in context language learning (ICLL). In ICLL, LMs are presented with a set of strings from a formal language, and must generate additional strings from the same language. We focus on in-context learning of regular languages generated by random finite automata. We evaluate a diverse set of neural sequence models (including several RNNs, Transformers, and state-space model variants) on regular ICLL tasks, aiming to answer three questions: (1) Which model classes are empirically capable of ICLL? (2) What algorithmic solutions do successful models implement to perform ICLL? (3) What architectural changes can improve ICLL in less performant models? We first show that Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks. Next, we provide evidence that their ability to do so relies on specialized "n-gram heads" (higher-order variants of induction heads) that compute input-conditional next-token distributions. Finally, we show that hard-wiring these heads into neural models improves performance not just on ICLL, but natural language modeling -- improving the perplexity of 340M-parameter models by up to 1.14 points (6.7%) on the SlimPajama dataset.

Overview

  • This paper explores in-context language learning (ICLL), focusing on the ability of language models to understand and generate unfamiliar formal languages.

  • ICLL model problems are utilized to test neural networks' classification and generation capabilities with linguistically structured, complex tasks.

  • The study evaluates different neural architectures, such as RNNs, Transformers, and state-space variants, in their ability to perform ICLL tasks.

  • Transformers outperformed other models in ICLL, with a significant contribution from specialized 'n-gram heads' that facilitated learning.

  • The introduction of n-gram heads into both Transformers and other architectures improved performance, suggesting benefits from integrating traditional language modeling mechanisms.

Introduction

The advent of powerful neural language models has been marked by a growing interest in in-context learning (ICL), where models adapt to new functions or distributions based on provided examples. However, understanding and improving ICL in the realm of large-scale language models remains a complex challenge. To address this, researchers have begun to hone in on in-context language learning (ICLL) as a means of investigating the capacity for language models to reason compositionally about sequences within formal languages—a subset of the broader ICL phenomenon.

ICLL Model Problems

ICLL model problems serve as a structured framework for probing neural networks' abilities to classify and generate language strings belonging to an unfamiliar formal language. Researchers define ICLL as the task where models are given strings sampled from a randomly generated language and must deduce the underlying distribution. This approach advances the study of ICL by presenting linguistically structured yet compositionally complex problems, reflective of tasks faced by large-scale language models.

Methodology

Seeking to decode the proficiency of different neural architectures in ICLL tasks, a systematic experiment was conducted evaluating various sequence models, ranging from traditional RNNs and Transformers to novel state-space variants. These models were challenged with tasks derived from regular languages represented by probabilistic finite automata. The study pursued three objectives: assessing which classes of models could efficiently conduct ICLL, uncovering the algorithmic solutions and circuits implemented by successful models, and exploring whether insights into ICLL processes could inform architectural advancements.

Results and Insights

The findings were multifaceted. Transformers demonstrated a superior capability in ICLL tasks compared to their recurrent and convolutional counterparts. Their prominence was ascribed in part to specialized "n-gram heads" that distilled next-token distributions by conditioning on preceding strings of tokens—akin to how n-gram models function. Through robust analysis including attention mechanisms, representational probing, and behavioral evaluation, these n-gram heads were pinpointed as a cornerstone of effective ICLL within Transformers.

Architectural Improvements

Drawing from these insights, researchers devised an architectural integration strategy wherein n-gram heads were inserted into both Transformers and non-Transformer architectures. This augmentation not only bolstered performance in artificial ICLL tasks but also enhanced perplexity scores in real-world language modeling. The success of these n-gram head insertions substantiates the idea that LLMs may benefit from incorporating explicit mechanisms reminiscent of more traditional language modeling algorithms.

Conclusion

The exploration into ICLL provides a clearer picture of how large-scale language models manage ICL. The introduction of n-gram heads crystallizes the notion that substantial attributes of language learning stem from mechanisms both new and old, challenging and advancing our comprehension of neural sequence models' in-context learning capabilities.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.