From Understanding to Utilization: A Survey on Explainability for Large Language Models (2401.12874v2)
Abstract: Explainability for LLMs is a critical yet challenging aspect of natural language processing. As LLMs are increasingly integral to diverse applications, their "black-box" nature sparks significant concerns regarding transparency and ethical use. This survey underscores the imperative for increased explainability in LLMs, delving into both the research on explainability and the various methodologies and tasks that utilize an understanding of these models. Our focus is primarily on pre-trained Transformer-based LLMs, such as LLaMA family, which pose distinctive interpretability challenges due to their scale and complexity. In terms of existing methods, we classify them into local and global analyses, based on their explanatory objectives. When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement. Additionally, we examine representative evaluation metrics and datasets, elucidating their advantages and limitations. Our goal is to reconcile theoretical and empirical understanding with practical implementation, proposing exciting avenues for explanatory techniques and their applications in the LLMs era.
- 2023. The devil is in the neurons: Interpreting and mitigating social biases in language models. In Openreview for International Conference on Learning Representations 2024.
- Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online. Association for Computational Linguistics.
- Haozhe An and Rachel Rudinger. 2023. Nichelle and nancy: The influence of demographic attributes and tokenization length on first name biases. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 388–401, Toronto, Canada. Association for Computational Linguistics.
- A general language assistant as a laboratory for alignment.
- A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256–3274, Online. Association for Computational Linguistics.
- Layer normalization.
- Grad-sam: Explaining transformers via gradient self-attention maps. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 2882–2887, New York, NY, USA. Association for Computing Machinery.
- "will you find these shortcuts?" a protocol for evaluating the faithfulness of input salience methods for text classification.
- Eliciting latent predictions from transformers with the tuned lens.
- Longformer: The long-document transformer.
- Language models are few-shot learners.
- Generating hierarchical explanations on text classification via feature interaction detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5578–5593, Online. Association for Computational Linguistics.
- Evaluating large language models trained on code.
- Dola: Decoding by contrasting layers improves factuality in large language models.
- A toy model of universality: Reverse engineering how networks learn group operations.
- Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland. Association for Computational Linguistics.
- Analyzing transformers in embedding space. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170, Toronto, Canada. Association for Computational Linguistics.
- Jump to conclusions: Short-cutting transformers with linear transformations.
- A survey on in-context learning.
- Joseph Enguehard. 2023. Sequential integrated gradients: a simple but effective method for explaining language models.
- Pathologies of neural models make interpretations difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3719–3728, Brussels, Belgium. Association for Computational Linguistics.
- Measuring the mixing of contextual information in the transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8698–8714, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- FairPrism: Evaluating fairness-related harms in text generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6231–6251, Toronto, Canada. Association for Computational Linguistics.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
- Dissecting recall of factual associations in auto-regressive language models.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Overthinking the truth: Understanding how language models process false demonstrations.
- Self-attention attribution: Interpreting information interactions inside transformer.
- In-context learning creates task vectors.
- Inspecting and editing knowledge representations in language models.
- John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
- Transformer-patcher: One mistake worth one neuron.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Shahar Katz and Yonatan Belinkov. 2023. Interpreting transformer’s attention dynamic memory and visualizing the semantic information flow of gpt.
- The (un)reliability of saliency methods.
- Investigating the influence of noise and distractors on the interpretation of neural networks.
- Attention is not only a weight: Analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7057–7075, Online. Association for Computational Linguistics.
- Analyzing feed-forward blocks in transformers through the lens of attention map.
- Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada. Association for Computational Linguistics.
- Emergent world representations: Exploring a sequence model trained on a synthetic task.
- Inference-time intervention: Eliciting truthful answers from a language model.
- Pmet: Precise model editing in a transformer. ArXiv, abs/2308.08742.
- Alpaca: A new semi-analytic model for metal absorption lines emerging from clumpy galactic environments.
- Truthfulqa: Measuring how models mimic human falsehoods.
- Lost in the middle: How language models use long contexts.
- Scott Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions.
- Locating and editing factual associations in gpt.
- Mass-editing memory in a transformer.
- Memory-based model editing at scale.
- DecompX: Explaining transformers decisions by propagating token decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2649–2664, Toronto, Canada. Association for Computational Linguistics.
- GlobEnc: Quantifying global token attribution by incorporating the whole encoder layer in transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 258–271, Seattle, United States. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback.
- Judea Pearl et al. 2000. Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress, 19(2):3.
- Copen: Probing conceptual knowledge in pre-trained language models.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
- Efficiently scaling transformer inference.
- Perturbation augmentation for fairer NLP. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9496–9521, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- What are you token about? dense retrieval as distributions over the vocabulary. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2481–2498, Toronto, Canada. Association for Computational Linguistics.
- "why should i trust you?": Explaining the predictions of any classifier.
- Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp.
- Integrated directional gradients: Feature interaction attribution for neural NLP models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 865–878, Online. Association for Computational Linguistics.
- Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR.
- Function vectors in large language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Transformer Circuits. 2022. Mechanistic interpretations of transformer circuits. Accessed: [insert access date here].
- Attention is all you need. In NIPS.
- Causal mediation analysis for interpreting neural nlp: The case of gender bias.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.
- Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore. Association for Computational Linguistics.
- Ethical and social risks of harm from language models.
- Efficient streaming language models with attention sinks.
- Local interpretation of transformer based on linear decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10270–10287, Toronto, Canada. Association for Computational Linguistics.
- Editing large language models: Problems, methods, and opportunities.
- Few-shot out-of-domain transfer learning of natural language explanations in a label-abundant setup. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3486–3501, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Explainability for large language models: A survey.