ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks (2303.15056v2)

Published 27 Mar 2023 in cs.CL and cs.CY

Abstract: Many NLP applications require manual data annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd-workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using a sample of 2,382 tweets, we demonstrate that ChatGPT outperforms crowd-workers for several annotation tasks, including relevance, stance, topics, and frames detection. Specifically, the zero-shot accuracy of ChatGPT exceeds that of crowd-workers for four out of five tasks, while ChatGPT's intercoder agreement exceeds that of both crowd-workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003 -- about twenty times cheaper than MTurk. These results show the potential of LLMs to drastically increase the efficiency of text classification.

Citations (731)

View on Semantic Scholar

Summary

The paper highlights that ChatGPT achieves up to 25 percentage points higher accuracy than MTurk in zero-shot text annotation tasks.
The study employs four diverse datasets to compare ChatGPT’s performance with trained research assistants and crowd-workers across multiple annotation types.
Cost efficiency analysis reveals ChatGPT is about thirty times cheaper per annotation, making it a compelling alternative to traditional crowdsourcing.

ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks

The paper "ChatGPT outperforms crowd-workers for text-annotation tasks" presents a thorough evaluation of ChatGPT as a tool for performing various text annotation tasks. The research compares the performance of ChatGPT against that of human annotators, both trained research assistants and crowd-workers on platforms like Amazon Mechanical Turk (MTurk). The paper provides compelling evidence in favor of utilizing LLMs for tasks traditionally dependent on human annotation.

The paper evaluates ChatGPT across four datasets composed of tweets and news articles, featuring a total of 6,183 entries. Annotation tasks included relevance, stance, topic identification, and frame detection. A notable metric throughout the paper is ChatGPT's zero-shot classification performance, which does not involve any additional task-specific training. This approach notably surpasses the results obtained through MTurk, with an average increase in accuracy by approximately 25 percentage points.

Key Findings

Accuracy and Agreement: ChatGPT exhibits superior accuracy and intercoder agreement compared to MTurk and even trained annotators. Across the datasets, ChatGPT's accuracy consistently outperformed MTurk, reaching a variance of about 25 percentage points on average. Its intercoder agreement rates were impressive, reaching up to 97% under certain configurations.
Cost Efficiency: The paper highlights the economic advantage of ChatGPT, with a negligible per-annotation cost of less than $0.003. This cost efficiency is a fraction—approximately thirty times cheaper—of the expenses incurred using crowd-sourced services like MTurk, making ChatGPT a highly viable option for large-scale annotation tasks.
Consistency of Performance: With varying configurations such as the temperature parameter, ChatGPT demonstrated remarkable consistency and reliability in text annotation, suggesting practical applicability across different contexts.

Implications and Future Directions

The implications of these findings are significant for both the academic and commercial spheres, as they suggest a paradigm shift in how text annotations can be performed. The paper emphasizes the potential of LLMs to not only enhance efficiency and reduce costs but also to maintain or even improve the quality of text annotations.

The use of ChatGPT in multilingual contexts, particularly in domains requiring nuanced understanding, remains an area ripe for exploration. Further research could delve into:

Implementation of few-shot learning for specific domains
Integration of semi-automated labeling systems, enhancing model recommendations based on human input
Comparative analysis of diverse LLMs to ascertain domain-specific advantages

Conclusion

The paper on ChatGPT's performance in text annotation tasks signifies a notable advancement in the capabilities of artificial intelligence in natural language processing. By achieving higher accuracy and agreement at a lower cost, ChatGPT and similar LLMs hold the promise of transforming traditional data annotation methodologies and challenging existing crowdsourcing paradigms such as MTurk. Through continued exploration and validation across varied tasks and languages, LLMs like ChatGPT could become pivotal tools in the evolving landscape of AI-driven text analysis.

PDF Markdown

Related Papers

Tweets

https://twitter.com/DrorKm89/status/1771823895877738563