Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Webcam-based Gaze Data as an Alternative for Human Rationale Annotations (2402.19133v1)

Published 29 Feb 2024 in cs.CL

Abstract: Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional information provided by gaze data, such as total reading times, gaze entropy, and decoding accuracy with respect to human rationale annotations. We compare WebQAmGaze, a multilingual dataset for information-seeking QA, with attention and explainability-based importance scores for 4 different multilingual Transformer-based LLMs (mBERT, distil-mBERT, XLMR, and XLMR-L) and 3 languages (English, Spanish, and German). Our pipeline can easily be applied to other tasks and languages. Our findings suggest that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online. Association for Computational Linguistics.
  2. Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Online. Association for Computational Linguistics.
  3. XAI for transformers: Better explanations through conservative propagation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 435–451. PMLR.
  4. Towards better understanding of gradient-based attribution methods for deep neural networks. In ICLR (Poster). OpenReview.net.
  5. On the cross-lingual transferability of monolingual representations. CoRR, abs/1910.11856.
  6. A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256–3274, Online. Association for Computational Linguistics.
  7. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):e0130140.
  8. How to explain individual classification decisions. Journal of Machine Learning Research, 11(61):1803–1831.
  9. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  10. Sequence classification with human attention. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 302–312, Brussels, Belgium. Association for Computational Linguistics.
  11. Eye gaze and self-attention: How humans and transformers attend words in sentences. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 75–87, Dublin, Ireland. Association for Computational Linguistics.
  12. Stephanie Brandl and Nora Hollenstein. 2022. Every word counts: A multilingual analysis of individual human alignment with model attention. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 72–77, Online only. Association for Computational Linguistics.
  13. e-snli: Natural language inference with natural language explanations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9539–9549. Curran Associates, Inc.
  14. Human attention in visual question answering: Do humans and deep networks look at the same regions? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 932–937, Austin, Texas. Association for Computational Linguistics.
  15. ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.
  16. Gaze entropy reflects surgical task load. Surgical endoscopy, 30:5034–5043.
  17. Oliver Eberle. 2022. Explainable structured machine learning. Ph.D. thesis, Technische Universität Berlin.
  18. Do transformer models show similar attention patterns to task-specific human gaze? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4295–4309, Dublin, Ireland. Association for Computational Linguistics.
  19. Onur Ferhat and Fernando Vilariño. 2016. Low cost eye tracking: The current panorama. Computational intelligence and neuroscience, 2016.
  20. Is home-based webcam eye-tracking with older adults living with and without alzheimer’s disease feasible? In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility, pages 1–3.
  21. Victor Petrén Bach Hansen and Anders Søgaard. 2021. Guideline bias in Wizard-of-Oz dialogues. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 8–14, Online. Association for Computational Linguistics.
  22. Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. Journal of Machine Learning Research, 24(34):1–11.
  23. Nora Hollenstein and Lisa Beinborn. 2021. Relative importance in sentence processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 141–150, Online. Association for Computational Linguistics.
  24. Multilingual language models predict human reading behavior. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 106–123, Online. Association for Computational Linguistics.
  25. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600):453–458.
  26. Webcam-based eye tracking to detect mind wandering and comprehension errors. Behavior Research Methods, pages 1–17.
  27. Looking deep in the eyes: Investigating interpretation methods for neural models on reading tasks using human eye-movement behaviour. Information Processing & Management, 60(2):103195.
  28. Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics.
  29. Improving sentence compression by learning to predict gaze. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1528–1533, San Diego, California. Association for Computational Linguistics.
  30. Gaze movement’s entropy analysis to detect workload levels. In Proceedings of International Conference on Trends in Computational and Cognitive Engineering: Proceedings of TCCE 2020, pages 147–154. Springer.
  31. Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38.
  32. A cross-lingual comparison of human and model relative word importance. In Proceedings of the 2022 CLASP Conference on (Dis)embodiment, pages 11–23, Gothenburg, Sweden. Association for Computational Linguistics.
  33. Nils Murrugarra-Llerena and Adriana Kovashka. 2017. Learning attributes from human gaze. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 510–519. IEEE.
  34. Searchgazer: Webcam eye tracking for remote studies of web search. In Proceedings of the 2017 conference on conference human information interaction and retrieval, pages 17–26.
  35. Don’t blame the annotator: Bias already starts in the annotation instructions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1779–1789, Dubrovnik, Croatia. Association for Computational Linguistics.
  36. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  37. Webqamgaze: A multilingual webcam eye-tracking-while-reading dataset. arXiv preprint arXiv:2303.17876.
  38. Avi Rosenfeld. 2021. Better metrics for evaluating explainable artificial intelligence. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’21, page 45–50, Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems.
  39. Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215.
  40. Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE, 109(3):247–278.
  41. Philipp Schmidt and Felix Bießmann. 2019. Quantifying interpretability and trust in machine learning systems. CoRR, abs/1901.08558.
  42. Kilian Semmelmann and Sarah Weigelt. 2018. Online webcam-based eye tracking in cognitive science: A first look. Behavior Research Methods, 50(2):451–465.
  43. Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy. Association for Computational Linguistics.
  44. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3145–3153.
  45. Interpreting attention models with human visual attention in machine reading comprehension. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 12–25, Online. Association for Computational Linguistics.
  46. William R. Swartout and Johanna D. Moore. 1993. Explanation in Second Generation Expert Systems, page 543–585. Springer-Verlag, Berlin, Heidelberg.
  47. Being right for whose right reasons? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1033–1054, Toronto, Canada. Association for Computational Linguistics.
  48. Annotation for annotation-toward eliciting implicit linguistic knowledge through annotation-(project note). In Proceedings of the 9th Joint ISO-ACL SIGSEM Workshop on Interoperable Semantic Annotation, pages 79–84.
  49. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics.
  50. AllenNLP interpret: A framework for explaining predictions of NLP models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 7–12, Hong Kong, China. Association for Computational Linguistics.
  51. Eye-tracking metrics predict perceived workload in robotic surgical skills training. Human factors, 62(8):1365–1386.
  52. Zhengxuan Wu and Desmond C. Ong. 2020. On explaining your explanations of bert: An empirical study with sequence classification. arXiv preprint.
  53. Turkergaze: Crowdsourcing saliency with webcam based eye tracking.
  54. Using “annotator rationales” to improve machine learning for text categorization. In Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics; proceedings of the main conference, pages 260–267.
  55. Towards a unified evaluation of explanation methods without ground truth. CoRR, abs/1911.09017.
  56. Yingyi Zhang and Chengzhi Zhang. 2019. Using human attention to extract keyphrase from microblog post. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5867–5872, Florence, Italy. Association for Computational Linguistics.
  57. Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Stephanie Brandl (14 papers)
  2. Oliver Eberle (14 papers)
  3. Tiago Ribeiro (29 papers)
  4. Anders Søgaard (122 papers)
  5. Nora Hollenstein (21 papers)

Summary

  • The paper establishes that webcam-based eye-tracking can serve as a reliable alternative to manual rationale annotations in NLP explainability.
  • It uses the multilingual WebQAmGaze dataset to compare gaze metrics with traditional attention scores across English, Spanish, and German.
  • Findings reveal that despite hardware limitations, gaze data provide robust linguistic insights and consistent ranking of model explanations.

Assessing Webcam-based Eye-tracking as a Viable Alternative for Annotating Rationales in NLP Explainability

Introduction

In the field of NLP and Explainable AI (XAI), the annotation of rationales has been a cornerstone for evaluating the effectiveness and reliability of models. However, the process of manually annotating these rationales is not only time-consuming but also subject to biases arising from the annotation process. This has led researchers to explore alternative methods of capturing human reasoning processes. One such method, which this paper focuses on, involves webcam-based eye-tracking recordings to infer the importance scores typically derived from manual annotations. We examine the WebQAmGaze dataset, a multilingual corpus for QA tasks, and its ability to parallel traditional rationale annotations through a comprehensive comparison with attention and explainability-based importance scores across multiple Transformer-based LLMs and languages.

Data and Methodology

The paper revolves around the WebQAmGaze dataset, which entails webcam-based eye-tracking recordings as participants engage with questions in English, Spanish, and German. It aims to assess whether gaze data, reflected through metrics such as total reading times, gaze entropy, and decoding accuracy, can serve as a reliable proxy for manually annotated rationales.

Results

Our analysis yielded several noteworthy findings:

  • Gaze Data as a Linguistic Insight Tool: The gaze data proffered valuable insights into linguistic processes, potentially serving as indicators of task difficulty.
  • Comparable Rankings: The rankings of explainability methods derived from gaze data closely mirrored those obtained from human-annotated rationales across the languages and models tested.
  • Effectiveness Across Languages: Decoding accuracy varied across languages, with particularly promising results for German. This suggests that the efficacy of gaze data as an alternative to manually annotated rationales might be language-dependent.
  • Webcam-based Eye-tracking Feasibility: Despite varying data quality primarily due to hardware constraints (e.g., the use of glasses affecting tracking accuracy), webcam-based eye-tracking emerged as a cost-effective method that could, with certain limitations, replicate the insights provided by lab-quality eye-tracking and manual annotations.

Practical and Theoretical Implications

This paper conveys important implications for both the practical application of eye-tracking in XAI and the theoretical understanding of human reasoning processes in NLP tasks. Practically, leveraging webcam-based eye-tracking can significantly reduce the resources required for rationale annotation, making large-scale studies more feasible and possibly enriching datasets with additional annotations that capture a different dimension of human cognition. Theoretically, the findings support the hypothesis that human gaze patterns, indicative of cognitive engagement and information processing, can serve as a meaningful proxy for identifying relevant text spans that explain model decisions.

Future Directions

While this paper lays a solid foundation, extensive exploration is warranted. Future research should focus on expanding the variety of tasks and languages examined, integrating more sophisticated models beyond the Transformer architecture, and addressing the limitations associated with data quality in webcam-based eye-tracking. Additionally, integrating gaze data with other psychophysiological signals could further enrich our understanding of the cognitive processes underpinning task performance and model explainability.

Conclusion

In summarizing, this paper advocates for a paradigm shift towards employing low-cost, webcam-based eye-tracking as a viable supplementary, if not alternative, method for annotating rationales in the evaluation of NLP explainability. While it recognizes the limitations of the current methodology, particularly in data quality and language diversity, it underscores the potential of gaze data to offer valuable insights into human cognition and reasoning. As we march forward, bridging the gap between human cognitive processes and AI explainability remains a pivotal task, one that gaze data can significantly contribute to.

X Twitter Logo Streamline Icon: https://streamlinehq.com