Evaluating Webcam-based Gaze Data as an Alternative for Human Rationale Annotations (2402.19133v1)
Abstract: Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional information provided by gaze data, such as total reading times, gaze entropy, and decoding accuracy with respect to human rationale annotations. We compare WebQAmGaze, a multilingual dataset for information-seeking QA, with attention and explainability-based importance scores for 4 different multilingual Transformer-based LLMs (mBERT, distil-mBERT, XLMR, and XLMR-L) and 3 languages (English, Spanish, and German). Our pipeline can easily be applied to other tasks and languages. Our findings suggest that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.
- Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online. Association for Computational Linguistics.
- Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Online. Association for Computational Linguistics.
- XAI for transformers: Better explanations through conservative propagation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 435–451. PMLR.
- Towards better understanding of gradient-based attribution methods for deep neural networks. In ICLR (Poster). OpenReview.net.
- On the cross-lingual transferability of monolingual representations. CoRR, abs/1910.11856.
- A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256–3274, Online. Association for Computational Linguistics.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):e0130140.
- How to explain individual classification decisions. Journal of Machine Learning Research, 11(61):1803–1831.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Sequence classification with human attention. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 302–312, Brussels, Belgium. Association for Computational Linguistics.
- Eye gaze and self-attention: How humans and transformers attend words in sentences. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 75–87, Dublin, Ireland. Association for Computational Linguistics.
- Stephanie Brandl and Nora Hollenstein. 2022. Every word counts: A multilingual analysis of individual human alignment with model attention. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 72–77, Online only. Association for Computational Linguistics.
- e-snli: Natural language inference with natural language explanations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9539–9549. Curran Associates, Inc.
- Human attention in visual question answering: Do humans and deep networks look at the same regions? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 932–937, Austin, Texas. Association for Computational Linguistics.
- ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.
- Gaze entropy reflects surgical task load. Surgical endoscopy, 30:5034–5043.
- Oliver Eberle. 2022. Explainable structured machine learning. Ph.D. thesis, Technische Universität Berlin.
- Do transformer models show similar attention patterns to task-specific human gaze? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4295–4309, Dublin, Ireland. Association for Computational Linguistics.
- Onur Ferhat and Fernando Vilariño. 2016. Low cost eye tracking: The current panorama. Computational intelligence and neuroscience, 2016.
- Is home-based webcam eye-tracking with older adults living with and without alzheimer’s disease feasible? In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility, pages 1–3.
- Victor Petrén Bach Hansen and Anders Søgaard. 2021. Guideline bias in Wizard-of-Oz dialogues. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 8–14, Online. Association for Computational Linguistics.
- Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. Journal of Machine Learning Research, 24(34):1–11.
- Nora Hollenstein and Lisa Beinborn. 2021. Relative importance in sentence processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 141–150, Online. Association for Computational Linguistics.
- Multilingual language models predict human reading behavior. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 106–123, Online. Association for Computational Linguistics.
- Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600):453–458.
- Webcam-based eye tracking to detect mind wandering and comprehension errors. Behavior Research Methods, pages 1–17.
- Looking deep in the eyes: Investigating interpretation methods for neural models on reading tasks using human eye-movement behaviour. Information Processing & Management, 60(2):103195.
- Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics.
- Improving sentence compression by learning to predict gaze. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1528–1533, San Diego, California. Association for Computational Linguistics.
- Gaze movement’s entropy analysis to detect workload levels. In Proceedings of International Conference on Trends in Computational and Cognitive Engineering: Proceedings of TCCE 2020, pages 147–154. Springer.
- Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38.
- A cross-lingual comparison of human and model relative word importance. In Proceedings of the 2022 CLASP Conference on (Dis)embodiment, pages 11–23, Gothenburg, Sweden. Association for Computational Linguistics.
- Nils Murrugarra-Llerena and Adriana Kovashka. 2017. Learning attributes from human gaze. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 510–519. IEEE.
- Searchgazer: Webcam eye tracking for remote studies of web search. In Proceedings of the 2017 conference on conference human information interaction and retrieval, pages 17–26.
- Don’t blame the annotator: Bias already starts in the annotation instructions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1779–1789, Dubrovnik, Croatia. Association for Computational Linguistics.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
- Webqamgaze: A multilingual webcam eye-tracking-while-reading dataset. arXiv preprint arXiv:2303.17876.
- Avi Rosenfeld. 2021. Better metrics for evaluating explainable artificial intelligence. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’21, page 45–50, Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems.
- Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215.
- Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE, 109(3):247–278.
- Philipp Schmidt and Felix Bießmann. 2019. Quantifying interpretability and trust in machine learning systems. CoRR, abs/1901.08558.
- Kilian Semmelmann and Sarah Weigelt. 2018. Online webcam-based eye tracking in cognitive science: A first look. Behavior Research Methods, 50(2):451–465.
- Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy. Association for Computational Linguistics.
- Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3145–3153.
- Interpreting attention models with human visual attention in machine reading comprehension. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 12–25, Online. Association for Computational Linguistics.
- William R. Swartout and Johanna D. Moore. 1993. Explanation in Second Generation Expert Systems, page 543–585. Springer-Verlag, Berlin, Heidelberg.
- Being right for whose right reasons? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1033–1054, Toronto, Canada. Association for Computational Linguistics.
- Annotation for annotation-toward eliciting implicit linguistic knowledge through annotation-(project note). In Proceedings of the 9th Joint ISO-ACL SIGSEM Workshop on Interoperable Semantic Annotation, pages 79–84.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics.
- AllenNLP interpret: A framework for explaining predictions of NLP models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 7–12, Hong Kong, China. Association for Computational Linguistics.
- Eye-tracking metrics predict perceived workload in robotic surgical skills training. Human factors, 62(8):1365–1386.
- Zhengxuan Wu and Desmond C. Ong. 2020. On explaining your explanations of bert: An empirical study with sequence classification. arXiv preprint.
- Turkergaze: Crowdsourcing saliency with webcam based eye tracking.
- Using “annotator rationales” to improve machine learning for text categorization. In Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics; proceedings of the main conference, pages 260–267.
- Towards a unified evaluation of explanation methods without ground truth. CoRR, abs/1911.09017.
- Yingyi Zhang and Chengzhi Zhang. 2019. Using human attention to extract keyphrase from microblog post. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5867–5872, Florence, Italy. Association for Computational Linguistics.
- Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5).
- Stephanie Brandl (14 papers)
- Oliver Eberle (14 papers)
- Tiago Ribeiro (29 papers)
- Anders Søgaard (122 papers)
- Nora Hollenstein (21 papers)