Emergent Mind

On the Limitations of Compute Thresholds as a Governance Strategy

(2407.05694)
Published Jul 8, 2024 in cs.AI , cs.CL , cs.ET , and cs.LG

Abstract

At face value, this essay is about understanding a fairly esoteric governance tool called compute thresholds. However, in order to grapple with whether these thresholds will achieve anything, we must first understand how they came to be. To do so, we need to engage with a decades-old debate at the heart of computer science progress, namely, is bigger always better? Does a certain inflection point of compute result in changes to the risk profile of a model? Hence, this essay may be of interest not only to policymakers and the wider public but also to computer scientists interested in understanding the role of compute in unlocking breakthroughs. This discussion is timely given the wide adoption of compute thresholds in both the White House Executive Orders on AI Safety (EO) and the EU AI Act to identify more risky systems. A key conclusion of this essay is that compute thresholds, as currently implemented, are shortsighted and likely to fail to mitigate risk. The relationship between compute and risk is highly uncertain and rapidly changing. Relying upon compute thresholds overestimates our ability to predict what abilities emerge at different scales. This essay ends with recommendations for a better way forward.

Challenges in setting domain-specific compute thresholds including risks, overhead, and potential for gamification.

Overview

  • The paper investigates the viability of compute thresholds as a governance strategy for AI, examining whether computational power alone can predict model risk and exploring the limitations of this approach.

  • Empirical evidence presented shows that the relationship between compute and model performance is highly variable and influenced by factors such as data quality, optimization techniques, and architectural innovations.

  • Recommendations include using dynamic thresholds, developing composite risk indices, standardizing FLOP measurement, and clearly articulating specific risks to improve AI governance.

On the Limitations of Compute Thresholds as a Governance Strategy

Introduction

The paper "On the Limitations of Compute Thresholds as a Governance Strategy," authored by Sara Hooker, investigates the emerging governance tool known as compute thresholds. This governance approach has garnered attention in the regulation of AI, particularly in the context of generative AI technologies. Compute thresholds, which measure the computational power used to train models, have been proposed as markers to delineate models with higher potential harm. This method is encoded in several policies, including the White House Executive Orders on AI Safety and the EU AI Act.

The paper poses several questions aimed at evaluating the viability of compute thresholds as a governance metric:

  1. Is compute, as measured by floating-point operations (FLOP), a meaningful metric to estimate model risk?
  2. Are hard-coded thresholds an effective tool to mitigate this risk?

Critical Findings

The Evolving Relationship Between Compute and Risk

The paper challenges the assumption that increased compute directly correlates with increased risk. Hooker presents empirical evidence showing that the relationship between compute and model performance is highly uncertain and rapidly evolving. Several factors contribute to this complexity:

  • Data Quality: Enhanced data curation techniques such as de-duplication and data pruning can reduce reliance on compute.
  • Optimization Breakthroughs: Techniques such as model distillation, retrieval-augmented generation, and preference training can improve performance without proportional increases in compute.
  • Architecture Innovations: Changes in model architectures, like convolutional neural networks (CNNs) and transformers, demonstrate that significant performance improvements can occur independently of compute scaling.

Challenges in Measuring FLOP

The paper outlines several practical challenges in using FLOP as a reliable metric:

  • Inference-Time Compute: Many approaches to enhance performance, such as chain-of-thought reasoning and tool use, involve substantial compute at inference time, not captured by training FLOP.
  • Model Lifecycle Variability: The compute spent across different stages—pre-training, fine-tuning, and model distillation—is uneven and difficult to standardize.
  • Architectural Variations: The treatment of models like Mixture of Experts (MoEs) and classical ensembling introduces additional complexity in FLOP measurement.

Predictive Limitations of Scaling Laws

The paper scrutinizes the predictive limitations of existing scaling laws, which traditionally attempt to forecast model performance based on compute. Evidence is provided showing that these laws often fail to generalize to downstream performance, highlighting the unpredictability of emergent properties. This unpredictability challenges the rationale for hard-coded compute thresholds set at (10{25}) or (10{26}) FLOP.

Recommendations

Hooker makes several recommendations to address these limitations:

  1. Dynamic Thresholds: Proposes using dynamic thresholds that adjust based on a percentile distribution of model properties released each year to account for the rapid evolution in compute-performance relationships.
  2. Composite Risk Indices: Advocates for the creation of risk indices composed of multiple measures beyond compute, such as benchmarks for specific risks (e.g., cybersecurity, bio-security).
  3. Standardization of FLOP Measurement: Urges the development of clear standards for measuring FLOP, covering quantization levels, hardware-specific variations, and architectural nuances.
  4. Specificity in Risk Articulation: Recommends governments clearly communicate the specific risks they aim to mitigate, thus enabling more focused and effective policy frameworks.

Implications and Future Directions

The introduction of compute thresholds reflects an increasing urgency to pre-emptively manage the risks presented by powerful AI models. However, the findings in this paper highlight the need for a more nuanced and adaptive approach.

Pragmatically, this research invites further refinement of governance strategies that are empirically grounded and adequately flexible to handle the dynamic nature of AI advancements. Theoretically, it underscores the inadequacy of purely computational metrics in capturing the complex risk profiles of modern AI systems.

Future developments in AI governance should consider composite and adaptive metrics that better capture the multifaceted dimensions of risk. This approach would help ensure that policies remain relevant and effective in mitigating both current and future AI risks.

Conclusion

Sara Hooker's paper provides a thorough critique of compute thresholds as a governance strategy, demonstrating the complexities and limitations of using static compute metrics to estimate and mitigate AI risks. While compute remains a crucial facet of AI development, a shift towards more dynamic, multifaceted approaches is essential to achieve effective and meaningful governance.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube