On the Limitations of Compute Thresholds as a Governance Strategy (2407.05694v1)

Published 8 Jul 2024 in cs.AI, cs.CL, cs.ET, and cs.LG

Abstract: At face value, this essay is about understanding a fairly esoteric governance tool called compute thresholds. However, in order to grapple with whether these thresholds will achieve anything, we must first understand how they came to be. This requires engaging with a decades-old debate at the heart of computer science progress, namely, is bigger always better? Hence, this essay may be of interest not only to policymakers and the wider public but also to computer scientists interested in understanding the role of compute in unlocking breakthroughs. Does a certain inflection point of compute result in changes to the risk profile of a model? This discussion is increasingly urgent given the wide adoption of governance approaches that suggest greater compute equates with higher propensity for harm. Several leading frontier AI companies have released responsible scaling policies. Both the White House Executive Orders on AI Safety (EO) and the EU AI Act encode the use of FLOP or floating-point operations as a way to identify more powerful systems. What is striking about the choice of compute thresholds to-date is that no models currently deployed in the wild fulfill the current criteria set by the EO. This implies that the emphasis is often not on auditing the risks and harms incurred by currently deployed models - but rather is based upon the belief that future levels of compute will introduce unforeseen new risks. A key conclusion of this essay is that compute thresholds as currently implemented are shortsighted and likely to fail to mitigate risk. Governance that is overly reliant on compute fails to understand that the relationship between compute and risk is highly uncertain and rapidly changing. It also overestimates our ability to predict what abilities emerge at different scales. This essay ends with recommendations for a better way forward.

Authors (1)

Sara Hooker (71 papers)

Citations (6)

View on Semantic Scholar

Summary

The paper challenges the assumption that higher compute directly correlates with increased AI risk due to advances in data quality and model innovations.
It uncovers key challenges in using FLOP as a metric, including measurement difficulties across different training stages and model architectures.
The study recommends dynamic thresholds and composite risk indices to foster more adaptive and effective AI governance policies.

On the Limitations of Compute Thresholds as a Governance Strategy

Introduction

The paper "On the Limitations of Compute Thresholds as a Governance Strategy," authored by Sara Hooker, investigates the emerging governance tool known as compute thresholds. This governance approach has garnered attention in the regulation of AI, particularly in the context of generative AI technologies. Compute thresholds, which measure the computational power used to train models, have been proposed as markers to delineate models with higher potential harm. This method is encoded in several policies, including the White House Executive Orders on AI Safety and the EU AI Act.

The paper poses several questions aimed at evaluating the viability of compute thresholds as a governance metric:

Is compute, as measured by floating-point operations (FLOP), a meaningful metric to estimate model risk?
Are hard-coded thresholds an effective tool to mitigate this risk?

Critical Findings

The Evolving Relationship Between Compute and Risk

The paper challenges the assumption that increased compute directly correlates with increased risk. Hooker presents empirical evidence showing that the relationship between compute and model performance is highly uncertain and rapidly evolving. Several factors contribute to this complexity:

Data Quality: Enhanced data curation techniques such as de-duplication and data pruning can reduce reliance on compute.
Optimization Breakthroughs: Techniques such as model distillation, retrieval-augmented generation, and preference training can improve performance without proportional increases in compute.
Architecture Innovations: Changes in model architectures, like convolutional neural networks (CNNs) and transformers, demonstrate that significant performance improvements can occur independently of compute scaling.

Challenges in Measuring FLOP

The paper outlines several practical challenges in using FLOP as a reliable metric:

Inference-Time Compute: Many approaches to enhance performance, such as chain-of-thought reasoning and tool use, involve substantial compute at inference time, not captured by training FLOP.
Model Lifecycle Variability: The compute spent across different stages—pre-training, fine-tuning, and model distillation—is uneven and difficult to standardize.
Architectural Variations: The treatment of models like Mixture of Experts (MoEs) and classical ensembling introduces additional complexity in FLOP measurement.

Predictive Limitations of Scaling Laws

The paper scrutinizes the predictive limitations of existing scaling laws, which traditionally attempt to forecast model performance based on compute. Evidence is provided showing that these laws often fail to generalize to downstream performance, highlighting the unpredictability of emergent properties. This unpredictability challenges the rationale for hard-coded compute thresholds set at $10^{25}$ or $10^{26}$ FLOP.

Recommendations

Hooker makes several recommendations to address these limitations:

Dynamic Thresholds: Proposes using dynamic thresholds that adjust based on a percentile distribution of model properties released each year to account for the rapid evolution in compute-performance relationships.
Composite Risk Indices: Advocates for the creation of risk indices composed of multiple measures beyond compute, such as benchmarks for specific risks (e.g., cybersecurity, bio-security).
Standardization of FLOP Measurement: Urges the development of clear standards for measuring FLOP, covering quantization levels, hardware-specific variations, and architectural nuances.
Specificity in Risk Articulation: Recommends governments clearly communicate the specific risks they aim to mitigate, thus enabling more focused and effective policy frameworks.

Implications and Future Directions

The introduction of compute thresholds reflects an increasing urgency to pre-emptively manage the risks presented by powerful AI models. However, the findings in this paper highlight the need for a more nuanced and adaptive approach.

Pragmatically, this research invites further refinement of governance strategies that are empirically grounded and adequately flexible to handle the dynamic nature of AI advancements. Theoretically, it underscores the inadequacy of purely computational metrics in capturing the complex risk profiles of modern AI systems.

Future developments in AI governance should consider composite and adaptive metrics that better capture the multifaceted dimensions of risk. This approach would help ensure that policies remain relevant and effective in mitigating both current and future AI risks.

Conclusion

Sara Hooker's paper provides a thorough critique of compute thresholds as a governance strategy, demonstrating the complexities and limitations of using static compute metrics to estimate and mitigate AI risks. While compute remains a crucial facet of AI development, a shift towards more dynamic, multifaceted approaches is essential to achieve effective and meaningful governance.

Related Papers

Tweets

https://twitter.com/sarahookr/status/1815360812787380701

https://twitter.com/MLStreetTalk/status/1814079285415567867

https://twitter.com/sarahookr/status/1840589168918876195

https://twitter.com/sarahookr/status/1821364195621032376

https://twitter.com/sarahookr/status/1821385850963587455

https://twitter.com/sarahookr/status/1815374812564639945

YouTube

Show All Videos

HackerNews

The Limitations of Compute Thresholds as a Governance Strategy (1 point, 0 comments)