Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets? (2309.01669v2)
Abstract: Instruction tuning has become an integral part of training pipelines for LLMs and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an open question how well AED methods generalize to language generation settings, which are becoming more widespread via LLMs. In this paper, we present a first and novel benchmark for AED on instruction tuning data: DONKII. It comprises three instruction-tuning datasets enriched with error annotations by experts and semi-automatic methods. We also provide a novel taxonomy of error types for instruction-tuning data. We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them extensively on the newly introduced dataset. Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.
- TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1558–1569, Online. Association for Computational Linguistics.
- Spotting Spurious Data with Neural Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2006–2016, New Orleans, Louisiana. Association for Computational Linguistics.
- PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland. Association for Computational Linguistics.
- Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
- Emily M. Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604.
- Yves Bestgen. 2015. Exact expected average precision of the random baseline for system evaluation. Prague Bull. Math. Linguistics, 103:131–138.
- Detecting label errors by using pre-trained language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9074–9091, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Markus Dickinson and W. Detmar Meurers. 2003. Detecting Errors in Part-of-Speech Annotation. In 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary. Association for Computational Linguistics.
- Is GPT-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7250–7274, Dublin, Ireland. Association for Computational Linguistics.
- On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5271–5285, Seattle, United States. Association for Computational Linguistics.
- BigBio: A Framework for Data-Centric Biomedical Natural Language Processing. In Thirty-Sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, New Orleans, Louisiana, USA.
- TGEA 2.0: A Large-Scale Diagnostically Annotated Dataset with Benchmark Tasks for Text Generation of Pretrained Language Models. Advances in Neural Information Processing Systems, 35:31612–31626.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.
- Not a cute stroke: Analysis of Rule- and Neural Network-based Information Extraction Systems for Brain Radiology Reports. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, pages 24–37, Online. Association for Computational Linguistics.
- Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
- Pere-Lluís Huguet Cabot and Roberto Navigli. 2021. REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370–2381, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
- Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future. Computational Linguistics, pages 1–42.
- Pavel Kvĕtoň and Karel Oliva. 2002. (Semi-)Automatic Detection of Errors in PoS-Tagged Corpora. In COLING 2002: The 19th International Conference on Computational Linguistics.
- Inconsistencies in Crowdsourced Slot-Filling Annotations: A Typology and Identification Methods. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5035–5046, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Outlier Detection for Improved Data Quality and Diversity in Dialog Systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 517–527, Minneapolis, Minnesota. Association for Computational Linguistics.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Identifying Mislabeled Data using the Area Under the Margin Ranking. In Advances in Neural Information Processing Systems, volume 33, pages 17044–17056. Curran Associates, Inc.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- The curious case of hallucinations in neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1172–1183, Online. Association for Computational Linguistics.
- Alpacadatacleaned. https://github.com/gururise/AlpacaDataCleaned.
- Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States. Association for Computational Linguistics.
- Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. CoRR, abs/2305.03047.
- Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online. Association for Computational Linguistics.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- ILDAE: Instance-Level Difficulty Analysis of Evaluation Data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3412–3425, Dublin, Ireland. Association for Computational Linguistics.
- Andreas Vlachos. 2006. Active Annotation. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006).
- Self-Instruct: Aligning Language Model with Self Generated Instructions.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Leon Weber and Barbara Plank. 2023. ActiveAED: A human in the loop improves annotation error detection. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8834–8845, Toronto, Canada. Association for Computational Linguistics.
- Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations.
- A Study of Incorrect Paraphrases in Crowdsourced User Utterances. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 295–306, Minneapolis, Minnesota. Association for Computational Linguistics.
- CrossFit: A few-shot learning challenge for cross-task generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7163–7189, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- LIMA: less is more for alignment. CoRR, abs/2305.11206.