Towards Safer Large Language Models through Machine Unlearning (2402.10058v2)

Published 15 Feb 2024 in cs.CL

Abstract: The rapid advancement of LLMs has demonstrated their vast potential across various domains, attributed to their extensive pretraining knowledge and exceptional generalizability. However, LLMs often encounter challenges in generating harmful content when faced with problematic prompts. To address this problem, existing work attempted to implement a gradient ascent based approach to prevent LLMs from producing harmful output. While these methods can be effective, they frequently impact the model utility in responding to normal prompts. To address this gap, we introduce Selective Knowledge negation Unlearning (SKU), a novel unlearning framework for LLMs, designed to eliminate harmful knowledge while preserving utility on normal prompts. Specifically, SKU is consisted of two stages: harmful knowledge acquisition stage and knowledge negation stage. The first stage aims to identify and acquire harmful knowledge within the model, whereas the second is dedicated to remove this knowledge. SKU selectively isolates and removes harmful knowledge in model parameters, ensuring the model's performance remains robust on normal prompts. Our experiments conducted across various LLM architectures demonstrate that SKU identifies a good balance point between removing harmful information and preserving utility.

Citations (42)

View on Semantic Scholar

Summary

The paper presents SKU, a two-stage framework that effectively unlearns harmful content while retaining LLM utility on benign queries.
SKU uses guided distortion, random disassociation, and preservation divergence modules to isolate and remove specific harmful knowledge.
Experimental results show that SKU significantly reduces harmful outputs with minimal impact on perplexity compared to conventional methods.

Towards Balancing Unlearning Harmful Knowledge and Preserving Utility in LLMs with SKU

Introduction

LLMs have transformed numerous aspects of AI applications thanks to their extensive pretraining on vast textual data. Their ability to generalize and adapt to various tasks post-training is unparalleled. However, a significant challenge for LLMs is their potential to generate harmful or inappropriate content in response to certain prompts. Traditional approaches to mitigate this issue, including reinforcement learning from human feedback (RLHF), while effective, come with substantial computational cost and potential misalignment problems. This paper introduces an innovative framework called Selective Knowledge negation Unlearning (SKU), focusing on efficiently removing harmful knowledge in LLMs while preserving their utility on normal prompts.

Methodology

SKU operates in two primary stages: harmful knowledge acquisition and knowledge negation stage. In the initial stage, harmful knowledge is intentionally learned from a corpus using a guided distortion module to recognize direct harmful responses, a random disassociation module for diversified harmful content acquisition, and a preservation divergence module to ensure the learned harmful knowledge does not overlap significantly with normal content. The knowledge negation stage then employs task vector negation, inspired by recent work in model weight interpolations, to specifically erase this harmful knowledge from the model. The paper provides a detailed exposition on each module's role in ensuring an LLM's learning and unlearning processes are finely tuned to diminish harmful output generation without degrading its performance on benign inputs.

Experiments

The paper's experimental analysis employs a variety of LLM architectures and benchmarks SKU against several baselines, including Fine-Tuning (FT), Gradient Ascent (GA), and Task Vector approaches, across dimensions of unlearning efficacy and utility. The results displayed show SKU's proficiency in drastically reducing harmful response rates without causing significant increases in perplexity for normal prompts, outperforming the baselines generally in balancing unlearning and utility preservation.

Discussion

Importantly, SKU's two-staged approach represents a methodologically distinctive path toward safer LLMs, focusing first on isolating harmful knowledge before selectively negating it. This process not only reduces the likelihood of generating harmful content but does so while retaining the model's original utility. The paper highlights the importance of each module within SKU, demonstrating through ablation studies how removing any component affects the overall efficacy and balance between unlearning and utility maintenance.

Future Directions

The paper speculates on future research directions, suggesting further exploration of SKU's applicability to other Right To Be Forgotten (RTBF) scenarios and its potential for broader implementations in LLM safety measures. Additionally, while SKU presents a substantial advancement in mitigating the generation of harmful content by LLMs, the quest for perfecting this balance continues, with the paper encouraging subsequent research to refine and enhance the approach.

Conclusion

This paper successfully introduces SKU, a novel framework for unlearning harmful knowledge in LLMs while maintaining utility. Through rigorous experimentation and detailed analysis, it demonstrates SKU's ability to significantly mitigate the risks associated with harmful content generation—a pivotal step towards safer, more reliable AI systems. By effectively addressing this critical challenge, SKU contributes meaningfully to the ongoing discourse on enhancing LLM safety and generalizability, marking a significant point of reference for future research in the field.