Recent work claims that LLMs display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.
The paper challenges the concept of emergent abilities in LLMs, suggesting these perceived abilities are artifacts of the metrics used.
It argues that nonlinear or discontinuous metrics distort the true, continuous performance improvements of LLMs, creating a false impression of sudden advancement.
Through empirical validation via experiments and meta-analysis, the study demonstrates that changing to linear metrics reveals smooth, predictable improvements.
The findings advocate for standardized, transparent metric selection in AI research to more accurately capture the gradual improvements in LLMs.
Research on LLMs has been centered around the notion of emergent abilities — sudden and unpredictable enhancements in performance on specific tasks as the scale of the models increases. These abilities have been deemed as unpredictable and sharp, fundamentally altering our perceptions of how LLMs advance with scale. However, recent studies propose a different viewpoint that challenges the narrative of intrinsic emergent abilities within LLMs. This paper suggests that what has been interpreted as emergent abilities may, in fact, be an artifact induced by the choice of metrics used by researchers to measure performance.
The central thesis of this paper is that the perceived emergent abilities in LLMs are not inherent properties of the model's sophistication or scale but are instead a byproduct of the application of nonlinear or discontinuous metrics by researchers. This argument is supported by a detailed mathematical model demonstrating how smooth and predictable improvements in LLM performance can be misconstrued as sudden emergent abilities through specific metric choices. These metrics, when applied, deform the true continuous performance improvements into seemingly unpredictable leaps in abilities, creating a mirage of emergence.
Three main points elucidate this thesis:
To substantiate the alternative explanation, the study undertakes a series of experiments comprising three distinct approaches:
The implications of this research are profound, offering a pivotal reevaluation of what constitutes emergent abilities in LLMs. Theoretically, it challenges the existing narrative by highlighting the impact of measurement choices on the interpretation of model capabilities. Practically, it advises caution in declaring emergent phenomena without considering the influence of evaluation metrics meticulously. Furthermore, the study paves the way for more standardized approaches to performance measurement in AI research, advocating for transparency in metric selection and suggesting a shift towards continuous, linear metrics to accurately capture the gradual improvements in LLMs.
Speculating on future developments, this research invites a wider discourse on the methodologies employed to assess and report the capabilities of AI models. It encourages the exploration of alternative frameworks that can more accurately reflect the incremental nature of advancements in model performance. Moreover, such insights call for collaborative efforts to standardize metrics and methodologies across AI research, ensuring that discoveries are genuinely reflective of advancements rather than artifacts of analysis.
In conclusion, this paper presents a compelling case that purported emergent abilities in LLMs are highly dependent on the metrics employed, challenging the community to reassess the foundational understanding of how LLMs evolve with scale.