Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets (1908.07898v2)

Published 21 Aug 2019 in cs.CL

Abstract: Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years. A common crowdsourcing practice is to recruit a small number of high-quality workers, and have them massively generate examples. Having only a few workers generate the majority of examples raises concerns about data diversity, especially when workers freely generate sentences. In this paper, we perform a series of experiments showing these concerns are evident in three recent NLP datasets. We show that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators. Moreover, we show that often models do not generalize well to examples from annotators that did not contribute to the training set. Our findings suggest that annotator bias should be monitored during dataset creation, and that test set annotators should be disjoint from training set annotators.

Citations (311)

View on Semantic Scholar

Summary

The paper shows that incorporating annotator IDs improves model performance across NLU datasets, with accuracy gains up to 4.2%.
Models can effectively infer annotator identities from text alone, revealing prominent language patterns among prolific contributors.
Experiments indicate significant performance drops on unseen annotators, emphasizing the need for diverse data and disjoint training-testing splits.

An Investigation of Annotator Bias in Natural Language Understanding Datasets

The paper by Geva, Goldberg, and Berant examines the prevalence and impact of annotator bias in Natural Language Understanding (NLU) datasets. This paper is particularly relevant given the reliance on crowdsourcing for generating large-scale datasets, which are fundamental for advancements in NLU models. The primary concern raised in this investigation is the potential for data diversity issues and generalization limitations due to a small number of annotators providing the majority of dataset annotations.

In this work, the authors conducted extensive experiments on three recent NLU datasets: MNLI, OpenBookQA, and CommonsenseQA. The paper aimed to determine whether models benefit from annotator-specific information, whether models can identify annotators based solely on the examples, and how well models generalize to examples from unseen annotators.

Key Findings

Annotator Information Utility: The first experiment established that model performance improves when annotator IDs are incorporated as features alongside the text input. This improvement was observed across all three datasets used in the paper. Specifically, the addition of annotator IDs yielded increases in accuracy for OpenBookQA (4.2%), CommonsenseQA (1.7%), and MNLI (1.6%). These results indicate that models gain predictive power when given access to annotator-specific information, suggesting annotator biases that the models can exploit.
Annotator Recognition: The second experiment demonstrated that models can quite effectively infer annotator identities from the input text alone, without explicit annotator identifiers. This ability was particularly pronounced for annotators who contributed a large number of examples. Recognition F1-scores ranged significantly but showed a strong correlation with the number of examples an annotator provided. For instance, the top contributors in CommonsenseQA had F1-scores between 0.76 and 0.91, indicating that models can detect and leverage annotator-specific language patterns.
Generalization Across Annotators: The most critical component of the investigation focused on how well models generalize to examples from annotators not seen during training. Through a series of experiments involving annotator-disjoint splits of the datasets, the authors found notable performance drops in specific cases. For OpenBookQA, the decrease in accuracy was striking, with performance declines as severe as 23 accuracy points on unseen annotators. This suggests that models trained on data from a few annotators fail to generalize well to new annotators, highlighting the need for diverse training data.
Annotator Bias vs. Example Difficulty: To disentangle annotator bias from the inherent difficulty of examples provided by different annotators, the authors conducted additional augmentation experiments. By gradually introducing examples from the development set into the training set, they found substantial performance upticks—signaling that exposure to a small number of examples from new annotators can significantly improve model performance due to reduced annotator bias rather than example complexity.

Implications

The evidence from this paper points to a crucial oversight in current dataset creation practices in NLU. Specifically, the reliance on a limited number of annotators introduces substantial biases into datasets, which consequently affects model performance and generalization. The practical implication is that ongoing and future work on dataset creation should incorporate mechanisms to diversify annotator contributions.

To address the identified issues, the authors propose two critical recommendations:

Monitoring Annotator Bias: During data collection, it is essential to systematically evaluate model performance on new annotators to detect and mitigate biases early.
Disjoint Annotator Sets for Training and Test: Constructing disjoint sets of annotators for training and testing datasets can help prevent models from exploiting annotator-specific cues, thereby fostering better generalization capabilities.

Future Directions

This investigation opens several avenues for future research. One potential direction involves refining crowdsourcing methodologies to balance between the quality and diversity of annotations. Another area of interest is the development of models robust to annotator biases, potentially by incorporating techniques from domain adaptation or adversarial training. Additionally, expanding this analysis to other NLU tasks and multilingual datasets could provide further insights into the universality of annotator bias and its effects.

In conclusion, the paper by Geva, Goldberg, and Berant significantly contributes to our understanding of annotator bias in NLU datasets, suggesting actionable strategies to enhance dataset quality and model robustness. Addressing these biases is crucial for developing models that not only perform well in controlled settings but also generalize effectively to real-world usage scenarios.

PDF Markdown