Dual Operating Modes of In-Context Learning (2402.18819v2)

Published 29 Feb 2024 in cs.LG

Abstract: In-context learning (ICL) exhibits dual operating modes: task learning, i.e., acquiring a new skill from in-context samples, and task retrieval, i.e., locating and activating a relevant pretrained skill. Recent theoretical work investigates various mathematical models to analyze ICL, but existing models explain only one operating mode at a time. We introduce a probabilistic model, with which one can explain the dual operating modes of ICL simultaneously. Focusing on in-context learning of linear functions, we extend existing models for pretraining data by introducing multiple task groups and task-dependent input distributions. We then analyze the behavior of the optimally pretrained model under the squared loss, i.e., the MMSE estimator of the label given in-context examples. Regarding pretraining task distribution as prior and in-context examples as the observation, we derive the closed-form expression of the task posterior distribution. With the closed-form expression, we obtain a quantitative understanding of the two operating modes of ICL. Furthermore, we shed light on an unexplained phenomenon observed in practice: under certain settings, the ICL risk initially increases and then decreases with more in-context examples. Our model offers a plausible explanation for this "early ascent" phenomenon: a limited number of in-context samples may lead to the retrieval of an incorrect skill, thereby increasing the risk, which will eventually diminish as task learning takes effect with more in-context samples. We also theoretically analyze ICL with biased labels, e.g., zero-shot ICL, where in-context examples are assigned random labels. Lastly, we validate our findings and predictions via experiments involving Transformers and LLMs.

Citations (17)

View on Semantic Scholar

Summary

The paper quantifies the dual modes of in-context learning by modeling pretraining data with a Gaussian mixture to rigorously analyze task learning versus task retrieval.
It explains the early ascent phenomenon, showing that a limited number of in-context examples can initially trigger higher risk before more data enables accurate task learning.
The study predicts bounded efficacy in biased-label ICL, demonstrating that performance ultimately degrades when excessive misaligned in-context examples overwhelm the retrieval process.

Understanding the Dual Operating Modes of In-Context Learning Through Probabilistic Modelling

In-context learning (ICL) has shown remarkable capabilities in leveraging pretrained LLMs for task adaptation with few-shot examples. This learning paradigm enables models to either learn anew or retrieve and fine-tune a relevant pretrained skill based on provided in-context samples. Such flexibility and efficiency in leveraging prior knowledge and adapting to new tasks underscore the dual operating modes of ICL: task learning and task retrieval.

The Study on Dual Operating Modes of ICL

A paper explores the intricate dynamics of these dual modes in ICL by proposing a probabilistic model tailored for analysing in-context learning of linear functions. Central to their approach is the consideration of pretraining data as drawn from a Gaussian mixture model—a choice that reflects the clustered nature of real-world data more accurately compared to previous assumptions of a single Gaussian distribution. This model allows for a rigorous demonstration of how a next-token prediction model, when optimally pretrained, employs Bayesian inference to optimally predict based on in-context examples.

Key Insights and Contributions

Quantitative Understanding of Dual Modes

By rigorously modeling pretraining data and analyzing the behavior of the optimally pretrained model under squared loss, the paper presents a quantitative understanding of the task learning and task retrieval modes in ICL. The analysis indicates the influence of in-context examples on task posterior distribution, introducing two critical phenomena: Component Shifting and Component Re-weighting.

Explaining the Early Ascent Phenomenon

The paper sheds light on the puzzling "early ascent" phenomenon observed with LLMs, where ICL risk initially rises with an increasing number of in-context samples before decreasing. The paper offers a plausible explanation by showing how a limited number of in-context samples initially may lead to the retrieval of an incorrect skill. However, as more in-context examples are included, task learning becomes more dominant, effectively diminishing the risk.

Predicted Bounded Efficacy of Biased-Label ICL

The analysis also forecasts a "bounded efficacy" phenomenon for ICL with biased labels—a method where in-context examples are assigned random labels. While initially effective due to task retrieval, the model's performance is predicted to degrade when the number of in-context examples reaches a certain threshold, and the task learning mode becomes dominant.

Practical Implications and Future Directions

This research provides a robust foundation for understanding and predicting the behavior of ICL under various settings. By explaining existing phenomena and predicting new ones, it not only enriches our theoretical understanding but also guides practical applications of ICL in leveraging LLMs. Future research could explore extending these insights to non-linear models and considering more complex in-context example distributions, further bridging the gap between theoretical models and real-world applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Kangwook_Lee/status/1767603595619246530

https://twitter.com/lhmccabe/status/1816213170907664791

https://twitter.com/myhakureimu/status/1764788903741919312

https://twitter.com/abeirami/status/1818819969036992841

https://twitter.com/PandaAshwinee/status/1767214441941119452

https://twitter.com/merz_garrett/status/1765150288699109507