The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

Published 4 Nov 2022 in cs.CL and cs.LG | (2211.02570v1)

Abstract: Human variation in labeling is often considered noise. Annotation projects for ML aim at minimizing human label variation, with the assumption to maximize data quality and in turn optimize and maximize machine learning metrics. However, this conventional practice assumes that there exists a ground truth, and neglects that there exists genuine human variation in labeling due to disagreement, subjectivity in annotation or multiple plausible answers. In this position paper, we argue that this big open problem of human label variation persists and critically needs more attention to move our field forward. This is because human label variation impacts all stages of the ML pipeline: data, modeling and evaluation. However, few works consider all of these dimensions jointly; and existing research is fragmented. We reconcile different previously proposed notions of human label variation, provide a repository of publicly-available datasets with un-aggregated labels, depict approaches proposed so far, identify gaps and suggest ways forward. As datasets are becoming increasingly available, we hope that this synthesized view on the 'problem' will lead to an open discussion on possible strategies to devise fundamentally new directions.

Abstract PDF Upgrade to Chat

Authors (1)

Barbara Plank

Citations (79)

View on Semantic Scholar

Summary

The paper demonstrates that aggregating human labels into a single ground truth can obscure inherent subjectivity and annotation ambiguity.
It introduces innovative methodologies that leverage un-aggregated, annotator-level data to improve model training and interpretability.
The authors advocate for revising evaluation practices by incorporating soft metrics like cross-entropy and Kullback-Leibler divergence to reflect label variability.

Human Label Variation in Machine Learning: An Overview

Introduction

The concept of human label variation (Hlv) presents a critical challenge in ML, particularly within the realms of NLP and computer vision (CV). This paper, titled "The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation" (2211.02570), addresses an often-overlooked aspect of ML: the assumption of a singular ground truth. The conventional ML pipeline—comprising data collection, modeling, and evaluation—typically aggregates human labels to derive a gold standard, thereby neglecting the inherent variation in human annotation. This oversight can impact the entire ML lifecycle, necessitating a reevaluation of how we view and utilize human label data.

Data and Human Label Variation

The foundation of any ML system is its data, which must be both valid and reliable. However, research has revealed that disagreements between annotators, even in seemingly objective tasks, can be quite prevalent. The paper defines Hlv as plausible annotation variation rather than simple disagreement, which implies incompatibility of views. In contrast, Hlv acknowledges factors such as subjectivity, genuine disagreement, and ambiguity, which can all result in multiple plausible annotations.

To address this, the paper advocates for the collection and release of un-aggregated, annotator-level data, inclusive of metadata such as annotator background and annotation process documentation. The authors highlight the increased availability of datasets featuring multiple annotations, underscoring the potential for these resources to inform more nuanced understanding of ML model behavior and limitations.

Figure 1: NLP Resource papers per publication year, counting publicly-available datasets released with human label variation.

Modeling and Human Label Variation

In terms of modeling, the paper categorizes existing methodologies into those that resolve Hlv and those that embrace it. Traditional methods, such as aggregation (e.g., majority voting) and filtering, attempt to address Hlv by consolidating it into a single label or discarding low-agreement data. However, these approaches can result in the loss of valuable information.

Emerging approaches aim to leverage Hlv by integrating un-aggregated labels directly into the learning process or by enriching models with human label variation data. This includes techniques such as repeated labeling and soft-label multi-task learning. Despite the promise shown in areas like CV, the adoption of these methods is still nascent within NLP, highlighting a need for comprehensive evaluations and explorations of task-specific properties that could influence method suitability.

Evaluation and Human Label Variation

Evaluation practices in ML, NLP, and CV predominantly rely on accuracy metrics against a singular ground truth. However, this approach is insufficient for tasks characterized by Hlv, as it fails to provide insight into model reasoning, confidence, and trustworthiness. The paper points to the necessity of moving beyond hard labels in evaluation and adopting soft metrics that consider human label distributions, such as cross-entropy and Kullback-Leibler divergence.

The authors also highlight the disparity between in-vitro (laboratory) and in-vivo (real-world) performance, stressing the importance of soft evaluation metrics that can better reflect this dynamic. This calls for an overhaul of standard evaluation practices to incorporate a broader array of metrics that accommodate the complexities introduced by Hlv.

Conclusions

The paper articulates a forward-looking view on Hlv, framing it as an opportunity to innovate across data collection, modeling, and evaluation within ML pipelines. By advocating for an interdisciplinary discourse and the incorporation of comprehensive Hlv frameworks, the authors set the stage for more inclusive and trustworthy AI systems. Their contributions, including a repository of datasets [https://github.com/mainlp/awesome-human-label-variation], are intended to spearhead discussions and encourage broader adoption of practices that acknowledge the multifaceted nature of human label data.

In conclusion, the recognition and integration of Hlv into the ML pipeline can lead to advancements in model robustness, calibration, and ultimately, applicability in diverse and real-world scenarios. As the field progresses, the continued exploration of Hlv-driven methodologies holds the potential to redefine how AI interacts with the nuanced and subjective dimensions of human cognition.

Markdown Report Issue