Emergent Mind

Who's Harry Potter? Approximate Unlearning in LLMs

(2310.02238)
Published Oct 3, 2023 in cs.CL and cs.AI

Abstract

LLMs are trained on massive internet corpora that often contain copyrighted content. This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers. In this paper, we propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch. We evaluate our technique on the task of unlearning the Harry Potter books from the Llama2-7b model (a generative language model recently open-sourced by Meta). While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model's ability to generate or recall Harry Potter-related content, while its performance on common benchmarks (such as Winogrande, Hellaswag, arc, boolq and piqa) remains almost unaffected. We make our fine-tuned model publicly available on HuggingFace for community evaluation. To the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models. Our technique consists of three main components: First, we use a reinforced model that is further trained on the target data to identify the tokens that are most related to the unlearning target, by comparing its logits with those of a baseline model. Second, we replace idiosyncratic expressions in the target data with generic counterparts, and leverage the model's own predictions to generate alternative labels for every token. These labels aim to approximate the next-token predictions of a model that has not been trained on the target data. Third, we finetune the model on these alternative labels, which effectively erases the original text from the model's memory whenever it is prompted with its context.

Overview

  • The paper tackles the challenge of removing sensitive data from trained LLMs without full retraining, proposing a more efficient 'approximate unlearning' method.

  • The new approach involves three stages: isolating the data to be forgotten, modifying it with generic alternatives and label predictions, and fine-tuning the LLM with these revisions.

  • The technique maintains the LLM's general linguistic capabilities while reducing its recall of the specific unlearned content, as shown by benchmarks and targeted prompts.

  • Results show the model's performance on general tasks remained intact, while its ability to regurgitate the removed data significantly decreased.

  • The paper suggests future work to refine the method, improve its generalizability, and calls for the AI community to engage in deeper research and adversarial testing.

Introduction

The field of AI continually grapples with the ethical, legal, and technological repercussions associated with the information content employed in training LLMs. One particular challenge is the development of approaches to selectively excise sensitive or problematic data from a fully trained model, a process termed "unlearning". The conventional strategy of full retraining is computationally exorbitant, prompting the search for more efficient alternatives. The work under discussion innovates in this space by proposing a technique for approximating unlearning within LLMs without exhaustive retraining.

Approach and Implementation

Delineating the method, the paper introduces a three-part technique applied to the Llama2-7b model. First part involves reinforcement model training on targeted data to isolate tokens linked with the data to be forgotten. The second step modifies the unlearning target data by replacing distinctive text with generic counterparts, while the model predicts alternative labels for subsequent token predictions mirroring a version unexposed to the target data. Finally, the model undergoes fine-tuning with these alternate labels to induce forgetting. This approach eschews the retrain-from-scratch paradigm, instead fine-tuning with a focus on targeted data removal, achieving material results in a fraction of the original training computational cost.

Evaluation and Outcomes

The paper extends into the evaluation methodology, centered on assessing the model's retention of general linguistic ability versus loss of the unlearned information. Retention is validated through established benchmarks like WinoGrande and HellaSwag, while loss is gauged using specifically crafted prompts to draw out information related to the unlearned content. The results indicate successful maintenance of the model's overall performance on general prompts, paralleled with a marked reduction in its ability to recall specifics from the expunged data. This balance substantiates the effectiveness of the proposed approach, though the authors acknowledge the potential for further refinement, especially as it relates to the method's generalizability.

Conclusions and Future Work

In sum, this paper presents an innovative step forward in the dynamic adaptation of LLMs post-training—fine-tuning them to conform to legal requirements, ethical norms, or particularized data handling needs. The proposed approximate unlearning method holds promise, especially for copyrighted content, yet the research identifies potential limitations when dealing with different types of data like non-fiction or textbooks. The concluding section invites the AI community to undertake deeper explorations and adversarial testing, offering the fine-tuned model as an open challenge on Hugging Face. The goal is to develop a robust unlearning process, further optimizing the balance between retaining core capabilities and eradicating specific, undesired knowledge from LLMs. The authors express hope for the method to be a stepping stone toward more responsible AI stewardship.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.