Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't Blame the Data, Blame the Model: Understanding Noise and Bias When Learning from Subjective Annotations (2403.04085v1)

Published 6 Mar 2024 in cs.CL and cs.CY

Abstract: Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreement text instances have been hard-to-learn is that the conventional aggregated models underperform in extracting useful signals from subjective tasks. Inspired by recent studies demonstrating the effectiveness of learning from raw annotations, we investigate classifying using Multiple Ground Truth (Multi-GT) approaches. Our experiments show an improvement of confidence for the high-disagreement instances.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15–24.
  2. Adversarial filters of dataset biases.
  3. Active bias: Training more accurate neural networks by emphasizing high variance samples.
  4. Statannotations.
  5. Checkpoint ensembles: Ensemble methods from a single training process.
  6. Understanding and utilizing deep neural networks trained with noisy labels.
  7. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388.
  8. Deep ensembles: A loss landscape perspective.
  9. Handling bias in toxic speech detection: A survey. ACM Comput. Surv., 55(13s).
  10. Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning.
  11. Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19.
  12. Evaluating scalable bayesian deep learning methods for robust computer vision.
  13. Whose emotions and moral sentiments do language models reflect? arXiv preprint arXiv:2402.11114.
  14. Constructing interval variables via faceted rasch measurement and multitask deep learning: a hate speech application. arXiv preprint arXiv:2009.10277.
  15. Pang Wei Koh and Percy Liang. 2020. Understanding black-box predictions via influence functions.
  16. Yuval Krymolowski. 2002. Distinguishing easy and hard instances. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002).
  17. Simple and scalable predictive uncertainty estimation using deep ensembles.
  18. Agreeing to disagree: Annotating offensive language datasets with annotators’ disagreement. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10528–10539, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  19. A simple baseline for bayesian uncertainty in deep learning.
  20. Patrick E McKnight and Julius Najab. 2010. Mann-whitney u test. The Corsini encyclopedia of psychology, pages 1–1.
  21. Dqi: Measuring data quality in nlp.
  22. Noise audits improve moral foundation classification. In 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 147–154. IEEE.
  23. Capturing perspectives of crowdsourced annotators in subjective learning tasks. arXiv preprint arXiv:2311.09743.
  24. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92–110.
  25. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift.
  26. Continual deep learning by functional regularisation of memorable past. In Advances in Neural Information Processing Systems, volume 33, pages 4453–4464. Curran Associates, Inc.
  27. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11).
  28. Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
  29. Álvaro Peris and Francisco Casacuberta. 2018. Active learning for interactive neural machine translation of data streams. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 151–160, Brussels, Belgium. Association for Computational Linguistics.
  30. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9617–9626.
  31. Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  32. Linguistically debatable or just plain wrong? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 507–511, Baltimore, Maryland. Association for Computational Linguistics.
  33. Identifying mislabeled data using the area under the margin ranking.
  34. Avinesh P.V.S and Christian M. Meyer. 2019. Data-efficient neural text compression with interactive learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2543–2554, Minneapolis, Minnesota. Association for Computational Linguistics.
  35. Georg Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests.
  36. The measuring hate speech corpus: Leveraging rasch measurement theory for data perspectivism. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, pages 83–94, Marseille, France. European Language Resources Association.
  37. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
  38. Social bias frames: Reasoning about social and power implications of language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5477–5490, Online. Association for Computational Linguistics.
  39. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States. Association for Computational Linguistics.
  40. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
  41. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online. Association for Computational Linguistics.
  42. An empirical study of example forgetting during deep neural network learning.
  43. A case for soft loss functions. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 8, pages 173–177.
  44. Everyone’s voice matters: Quantifying annotation disagreement using demographic information. Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):14523–14530.
  45. Xinpeng Wang and Barbara Plank. 2023. ACTOR: Active learning with annotator-specific classification heads to embrace human label variation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2046–2052, Singapore. Association for Computational Linguistics.
  46. Disagreement matters: Preserving label diversity by jointly modeling item and annotator label distributions with DisCo. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4679–4695, Toronto, Canada. Association for Computational Linguistics.
  47. A walk with sgd.
Citations (4)

Summary

We haven't generated a summary for this paper yet.