AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators

Published 29 Mar 2023 in cs.CL | (2303.16854v2)

Abstract: Many NLP tasks rely on labeled data to train machine learning models with high performance. However, data annotation is time-consuming and expensive, especially when the task involves a large amount of data or requires specialized domains. Recently, GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. In this paper, we first claim that LLMs, such as GPT-3.5, can serve as an excellent crowdsourced annotator when provided with sufficient guidance and demonstrated examples. Accordingly, we propose AnnoLLM, an annotation system powered by LLMs, which adopts a two-step approach, explain-then-annotate. Concretely, we first prompt LLMs to provide explanations for why the specific ground truth answer/label was assigned for a given example. Then, we construct the few-shot chain-of-thought prompt with the self-generated explanation and employ it to annotate the unlabeled data with LLMs. Our experiment results on three tasks, including user input and keyword relevance assessment, BoolQ, and WiC, demonstrate that AnnoLLM surpasses or performs on par with crowdsourced annotators. Furthermore, we build the first conversation-based information retrieval dataset employing AnnoLLM. This dataset is designed to facilitate the development of retrieval models capable of retrieving pertinent documents for conversational text. Human evaluation has validated the dataset's high quality.

Abstract PDF Upgrade to Chat

Authors (10)

Citations (159)

View on Semantic Scholar

Summary

The paper introduces AnnoLLM, a two-step explain-then-annotate framework that improves annotation accuracy, achieving 75.60% on the QK task versus 71.5% by humans.
The study demonstrates that consistent GPT-3.5 generated explanations enable robust few-shot chain-of-thought prompts that stabilize annotation performance across tasks.
The paper highlights the critical role of prompt design, showing that optimizing few-shot CoT approaches is key to enhancing LLM-based annotation efficiency.

AnnoLLM: Enhancing LLMs as Crowdsourced Annotators

The paper "AnnoLLM: Making LLMs to Be Better Crowdsourced Annotators" addresses the challenge of data annotation in NLP tasks, which is often labor-intensive and time-consuming, especially for large datasets or those requiring domain-specific knowledge. The authors explore the potential of LLMs, specifically the GPT-3.5 series, as effective alternatives to traditional crowdsourced annotators by proposing AnnoLLM, an annotation system leveraging LLMs.

Methodology

AnnoLLM operates via a two-step approach termed "explain-then-annotate." Initially, LLMs are prompted to generate explanations for why specific labels are appropriate for given examples. These explanations are then employed to construct few-shot chain-of-thought (CoT) prompts, which the LLMs use to annotate unlabeled data. This method is inspired by existing human annotation processes, wherein annotators require task definitions, category clarifications, and sample annotations for reference.

Experimental Validation

The efficacy of AnnoLLM is evaluated across three tasks: user input and keyword relevance assessment (QK), BoolQ (a question-answering task), and WiC (Word-in-Context task). These tasks are chosen for their diversity in classification challenges. The results indicate that AnnoLLM not only outperforms traditional few-shot LLM annotation strategies but, in certain cases such as QK, surpasses human annotator performance. The stability and consistency of the explanations generated for constructing CoT prompts are also analyzed, showing improvements in annotation quality and robustness across different task prompts.

Key Findings

Performance Superiority: AnnoLLM achieved significant improvements in annotation accuracy over both zero-shot and few-shot baselines, demonstrating its potential to replace human annotators effectively. For instance, AnnoLLM achieved 75.60% accuracy on the QK task test set compared to 71.5% by human annotators.
Explanation Consistency: Explanations generated by GPT-3.5 were found to be consistent across different models, contributing to stable CoT prompts that improved annotation accuracy.
Sensitivity to Prompts: The study highlights the importance of prompt design, noting that the few-shot approach is more sensitive to prompt variations than the few-shot CoT approach.
Dataset Creation: Beyond data annotation, AnnoLLM was applied to create a conversation-based information retrieval dataset, illustrating its utility in constructing datasets where traditional methods fall short.

Implications and Future Directions

The AnnoLLM framework opens avenues for more efficient and scalable dataset annotation, aligning with the increasing demand for annotated data in the era of deep learning. Its capacity to potentially automate and enhance annotation tasks can lead to significant cost and time savings in NLP projects.

For future research, exploring the adaptability of AnnoLLM in other domains, such as multimodal datasets involving text alongside audio, images, or video, could expand its applicability. Additionally, further investigation into refining CoT prompts and exploring diverse model architectures could provide deeper insights into optimizing LLM-based annotation systems.

In conclusion, the AnnoLLM framework presents a promising evolution in leveraging the advanced capabilities of LLMs for annotation tasks, pointing towards a future where LLMs not only augment but replace conventional data annotation methodologies. This paper contributes significantly to the discourse on operationalizing LLMs in practical NLP applications, setting a precedent for subsequent studies in the field.

Markdown Report Issue