What Do Language Models Learn in Context? The Structured Task Hypothesis (2406.04216v3)

Published 6 Jun 2024 in cs.CL and cs.LG

Abstract: LLMs exhibit an intriguing ability to learn a novel task from in-context examples presented in a demonstration, termed in-context learning (ICL). Understandably, a swath of research has been dedicated to uncovering the theories underpinning ICL. One popular hypothesis explains ICL by task selection. LLMs identify the task based on the demonstration and generalize it to the prompt. Another popular hypothesis is that ICL is a form of meta-learning, i.e., the models learn a learning algorithm at pre-training time and apply it to the demonstration. Finally, a third hypothesis argues that LLMs use the demonstration to select a composition of tasks learned during pre-training to perform ICL. In this paper, we empirically explore these three hypotheses that explain LLMs' ability to learn in context with a suite of experiments derived from common text classification tasks. We invalidate the first two hypotheses with counterexamples and provide evidence in support of the last hypothesis. Our results suggest an LLM could learn a novel task in context via composing tasks learned during pre-training.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces the structured task composition hypothesis, showing that LLMs combine pre-learned tasks to tackle novel challenges.
The paper systematically compares task selection, meta-learning, and task composition through experiments on text classification benchmarks with LLaMA2.
The paper demonstrates that performance degradation in higher-order task compositions offers actionable insights for refining in-context learning strategies.

In-Context Learning in LLMs: Task Composition Hypothesis

This essay examines the empirical and theoretical investigation presented by Jiaoda Li, Yifan Hou, Mrinmaya Sachan, and Ryan Cotterell regarding the in-context learning (ICL) capabilities of LLMs. The paper titled "What Do LLMs Learn in Context? The Structured Task Hypothesis" explores and evaluates three hypotheses explaining how LLMs manage to learn novel tasks given in-context demonstrations. The authors systematically test the task selection, meta-learning, and structured task composition hypotheses using a series of rigorously designed experiments.

Investigated Hypotheses

Task Selection Hypothesis: This hypothesis posits that LLMs recognize the task from the given demonstration and map it to pre-learned tasks from their training data.
Meta-Learning Hypothesis: This approach suggests that LLMs learn generalizable algorithms during pre-training, which they then apply to solve the given in-context tasks.
Task Composition Hypothesis: This suggests that LLMs utilize the demonstration to compose a series of pre-learned tasks into a novel task not explicitly encountered during pre-training.

Experimental Design and Results

The authors implemented an experimental approach using text classification tasks on well-known datasets including Customer Reviews (CR), Stanford Sentiment Treebank (SST-2), and AG News, among others. The LLM employed for these experiments was the widely recognized LLaMA2 model with variations in its parameters (7B, 13B, and 70B). Multiple experimental settings were designed to validate or invalidate the competing hypotheses.

Testing the Task Selection Hypothesis:

Response-Altered ICL ($\text{$\para$}$-ICL): The responses in the demonstration pairs were altered to form a novel task. If the task selection hypothesis holds, the model performance should be close to random guessing as the novel task is unobserved during training.
Results: The ability to perform $\text{$\para$}$-ICL significantly better than random guessing invalidates the task selection hypothesis.

Testing the Meta-Learning Hypothesis:

Prompt-Altered ICL ($\text{$\para$}$-ICL): This test replaced prompts in the demonstration with adversarially altered equivalents under the assumption that meta-learning would generalize the learning algorithm irrespective of specific prompts.
Comparative Baseline: The model was benchmarked against logistic regression applied to token embeddings.
Results: Here, $\text{$\para$}$-ICL consistently performed poorly, often worse than logistic regression and definitely worse than $\text{$\para$}$-ICL, thus providing evidence against the meta-learning hypothesis.

Testing the Task Composition Hypothesis:

Composable Tasks ($\text{$\para$}$): The validity of this hypothesis stems from the ability to perform well on composed task pairs and observing the model performance degenerate as composition complexity increases.
Manual vs Random Mapping: Hand-crafted natural mappings, such as synonyms and antonyms, were significantly better learned in-context compared to arbitrary mappings.
Higher-Order Synonyms: As the order of synonym compositions increased, the performance deteriorated, yet stayed significantly better than arbitrary mappings, supporting the structured task composition hypothesis.

Implications

The findings from these experiments have significant implications for understanding the mechanism underpinning in-context learning within LLMs. The empirical evidence against the task selection and meta-learning hypotheses indicate that LLMs do not solely rely on recognizing pre-learned tasks or deploying generalized learning algorithms at inference. Instead, the support for the task composition hypothesis suggests that LLMs utilize a form of function composition based on pre-learned primitive tasks from training data to derive and learn new tasks.

Future Directions

Given the insights from the structured task composition hypothesis, future research could focus on formalizing the theoretical underpinnings of this task composition mechanism within LLMs. Additionally, extending the scope to richer and more complex compositional tasks could illuminate further intricacies in LLMs' learning processes. Robust formalization will also aid in fine-tuning training regimes to optimize LLMs for more dynamic and sophisticated in-context learning tasks.

In conclusion, this paper provides a foundational understanding that shifts the paradigm from task recognition and meta-learning theories towards a more intricate structured task composition framework. This not only helps in better interpreting the present capabilities of LLMs but also opens avenues for enhancing future models.