Machine Unlearning Fails to Remove Data Poisoning Attacks (2406.17216v2)

Published 25 Jun 2024 in cs.LG, cs.AI, cs.CR, and cs.CY

Abstract: We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of settings, they fail to remove the effects of data poisoning across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, are required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned data without having to retrain, our work suggests that these methods are not yet ``ready for prime time,'' and currently provide limited benefit over retraining.

Citations (5)

View on Semantic Scholar

Summary

The paper shows that state-of-the-art unlearning algorithms consistently fail to remove the effects of data poisoning even with generous computational budgets.
It employs a comprehensive evaluation across seven methods on language and vision tasks, revealing that gradient updates are ineffective against orthogonal model shifts induced by poisons.
The study highlights the need for new unlearning strategies with provable guarantees or rigorous empirical validation to robustly counter diverse poisoning challenges.

Machine Unlearning Fails to Remove Data Poisoning Attacks

The paper "Machine Unlearning Fails to Remove Data Poisoning Attacks" by Martin Pawelczyk, Jimmy Z. Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, and Seth Neel presents a comprehensive reevaluation of the efficacy of current machine unlearning algorithms in handling data poisoning attacks. The authors perform extensive experimental analyses to gauge whether these unlearning methods can effectively mitigate the impact of several types of data poisoning attacks, including indiscriminate, targeted, and a newly introduced Gaussian poisoning attack.

Key Findings

Broad Failure Across Methods and Metrics:
- The paper demonstrates that state-of-the-art unlearning algorithms generally fail to remove the effects of data poisoning under various settings. Even when granted a generous compute budget (relative to retraining), none of the methods could reliably mitigate the adverse impacts of the poisoning attacks.
Implementation and Evaluation:
- Seven unlearning algorithms, including Gradient Descent (GD), Noisy Gradient Descent (NGD), Gradient Ascent (GA), Exact Unlearning the last k layers (EUk), Catastrophically forgetting the last k layers (CFk), SCRUB, and NegGrad+, were examined across standard language and vision classification tasks.
- Evaluation involved metrics tailored to each poisoning type. For Gaussian data poisoning, a new measure called the Gaussian Unlearning Score (GUS) was introduced, focusing on the correlation between added noise and the model's gradients.
Diverse Challenges in Unlearning:
- Different types of data poisoning attacks present unique challenges:
  - Targeted Data Poisoning: The success of unlearning algorithms varied, with many failing to revert the effects on specified target samples.
  - Indiscriminate Data Poisoning: Methods like GD showed some improvement in model performance, yet failed to provide substantial benefits over retraining.
  - Gaussian Data Poisoning: Techniques such as NGD on Resnet-18 showed a decrease in the Gaussian Unlearning Score after unlearning, but not to the extent achieved by retraining, highlighting a gap in efficacy.
- The success of unlearning algorithms was highly dependent on the underlying task, with some methods showing partial success in text classification but failing in image classification, and vice versa.

Hypotheses and Failures

The failure of unlearning methods was attributed to two chief hypotheses:

Large Model Shift Induced by Poisons:
- The authors hypothesize that poison samples induce a larger model shift than random clean samples. This increased shift necessitates more update steps for effective unlearning, which the tested algorithms could not achieve within the practical computational budget.
- Experiments using logistic regression on Resnet-18 features confirmed this hypothesis, showing significant $_1$ norm distances for models trained with poisoned versus clean data.
Orthogonal Model Shifts:
- Poison samples shift the model in a subspace orthogonal to that spanned by clean training samples. Gradient-based unlearning updates using only clean samples fail to correct shifts within this orthogonal subspace.
- Linear regression experiments demonstrated that the desired update direction for mitigating poisoning was orthogonal to the gradient descent updates using clean data.

Implications and Future Directions

The findings suggested that heuristic methods for machine unlearning might convey a false sense of security. The results advocate for a more comprehensive evaluation involving diverse attack vectors and stress the necessity for either provable guarantees or thorough empirical evaluations for unlearning algorithms. Particularly, the paper underscores that:

Current heuristic unlearning methods are not sufficiently reliable for deployment in real-world scenarios.
Future research should prioritize developing new unlearning techniques that can effectively handle the varied effects of data poisoning without the prohibitive costs associated with complete retraining.

The results also pose practical recommendations for improving existing methods. Enhancing the alignment of unlearning updates with the specific directions induced by poisons, leveraging additional structural information about the model, and combining multiple unlearning strategies might provide more robust solutions. The paper sets a valuable benchmark and guidepost for future unlearning research to achieve more dependable outcomes.

Related Papers

Tweets

https://twitter.com/thegautamkamath/status/1810668671809003902

https://twitter.com/ADarmouni/status/1811182953592197455

https://twitter.com/ayush_sekhari/status/1891707857344639439

https://twitter.com/WGOV/status/1805843630491336817

https://twitter.com/VarunChandrase3/status/1839680968346554824