Intervention Lens: from Representation Surgery to String Counterfactuals (2402.11355v4)
Abstract: Interventions targeting the representation space of LLMs (LMs) have emerged as an effective means to influence model behavior. Such methods are employed, for example, to eliminate or alter the encoding of demographic information such as gender within the model's representations and, in so doing, create a counterfactual representation. However, because the intervention operates within the representation space, understanding precisely what aspects of the text it modifies poses a challenge. In this paper, we give a method to convert representation counterfactuals into string counterfactuals. We demonstrate that this approach enables us to analyze the linguistic alterations corresponding to a given representation space intervention and to interpret the features utilized to encode a specific concept. Moreover, the resulting counterfactuals can be used to mitigate bias in classification through data augmentation.
- Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4349–4357.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pages 120–128.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. CoRR, abs/1901.09451.
- Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175.
- Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138–1158.
- Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495.
- A geometric notion of causal probing. arXiv preprint arXiv:2307.15054.
- Automatically categorizing written texts by author gender. Literary and linguistic computing, 17(4):401–412.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Text embeddings reveal (almost) as much as text. arXiv preprint arXiv:2310.06816.
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
- Judea Pearl. 1988. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann.
- Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237–7256.
- Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 194–209.
- Adversarial concept erasure in kernel space. arXiv preprint arXiv:2201.12191.
- Effects of age and gender on blogging. In AAAI spring symposium: Computational approaches to analyzing weblogs, volume 6, pages 199–205.
- Mimic: Minimally modified counterfactuals in the representation space.
- Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 566–581.
- Matan Avitan (2 papers)
- Ryan Cotterell (226 papers)
- Yoav Goldberg (142 papers)
- Shauli Ravfogel (38 papers)