Emergent Mind

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

(2407.00106)

Published Jun 27, 2024 in cs.LG , cs.AI , cs.CL , and cs.CR

Abstract

Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in LLMs and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.

Unlearning a term to prevent AI from providing bomb recipes, with adversarial attempts to bypass it.

Overview

The paper revisits the concept of unlearning in LLMs and examines its effectiveness for content regulation.
It highlights the shortcomings of unlearning, especially given the model's ability to reintroduce unlearned knowledge through in-context learning (ICL).
The authors argue for the necessity of robust filtering mechanisms and a reevaluation of current unlearning approaches to effectively regulate content in generative AI.

Unlearning is Not Sufficient for Content Regulation in Advanced Generative AI

This paper, authored by Ilia Shumailov et al. from Google DeepMind, revisits and critically examines the paradigm of unlearning when applied to LLMs. Initially posited as a privacy mechanism to allow the retraction of user data from machine learning models, unlearning has evolved to encompass various applications aimed at removing undesirable knowledge or capabilities from models. The paper scrutinizes the feasibility and effectiveness of unlearning to enforce content regulation, particularly in the context of in-context learning (ICL) capabilities inherent in modern LLMs.

The paper underscores a fundamental inconsistency: unlearning can be effective during the training phase but does not necessarily prevent the model from performing impermissible acts during inference. The authors introduce a concept termed " " (placeholder due to formatting issues in the provided content), which denotes the situation where unlearned knowledge can be reintroduced through contextual interactions, thus compromising the intended regulatory effect.

Key Definitions and Concepts

The paper provides several informal definitions critical to its arguments:

Knowledge: Information available to the model, including in-context provided inputs, information stored in model parameters, or retrievable evidence.
Content Filtering: The process of filtering queries and responses to and from the model, which can be integral or external to the model.
Unlearning:

For Privacy: Removing knowledge corresponding to specific subsets of the training data, making the model indistinguishable from one retrained without those subsets.
For Content Regulation: Removing knowledge believed to be associated with impermissible content production, intended to prevent the model from generating such content.
In-Context Learning: The model's ability to generalize to tasks based on descriptions provided at inference time, even if those tasks were not present in the training data.

Unlearning for Content Regulation

The paper specifically addresses unlearning as a mechanism for content regulation, highlighting its shortcomings due to the ICL capability of LLMs. For example, even if direct references to impermissible content (e.g., bomb-making) are unlearned, the model can still perform impermissible actions if the underlying knowledge required to reconstruct such content remains intact.

Types of Knowledge

To elucidate the concept, the paper classifies knowledge into axioms and theorems:

Axioms: Fundamental units of knowledge.
Theorems: More complex concepts derived from axioms.

The authors demonstrate that removing a theorem (e.g., the concept of a tiger) while retaining related axioms does not prevent the model from reconstruing the theorem under different disguises or contexts, as shown in their illustrative example.

Implications

The paper discusses the various implications of their findings:

Need for Effective Filtering Mechanisms: Unlearning alone is insufficient; continuous, active filtering is required to suppress attempts to reintroduce impermissible knowledge.
Re-evaluating Unlearning Mechanisms: Existing unlearning methods do not provide a comprehensive solution. New definitions and mechanisms are necessary to address the issue of reintroduction via ICL.
Attributing Knowledge: The complexity of attributing knowledge reintroduced through ICL poses moral and practical challenges.
Forbidding Knowledge: While explicitly forbidding certain types of knowledge might help, it is not a foolproof solution due to potential mosaic attacks and the challenge of anticipating all harmful use-cases.

Conclusion

The authors argue that unlearning, while useful in some contexts, cannot be relied upon as a primary tool for content regulation in LLMs due to their ICL capabilities. They call for more robust filtering mechanisms and a rethinking of current unlearning approaches to deal with the complex challenge of ensuring models cannot perform impermissible acts.

Overall, this paper provides a critical assessment of the limitations of unlearning in the context of advanced generative AI, emphasizing the need for comprehensive strategies that include filtering and possibly redefined unlearning mechanisms to achieve effective content regulation.

Create an account to read this summary for free:

https://twitter.com/iliaishacked/status/1808424130754064386

https://twitter.com/fly51fly/status/1809552553920876693

https://twitter.com/chien_eli/status/1809716241961218271

https://twitter.com/TheTuringPost/status/1811193518653841483

https://twitter.com/GptMaestro/status/1808990184777920600