Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Practice of Efficient Data Collection via Crowdsourcing at Large-Scale (1912.04444v1)

Published 10 Dec 2019 in cs.HC

Abstract: Modern machine learning algorithms need large datasets to be trained. Crowdsourcing has become a popular approach to label large datasets in a shorter time as well as at a lower cost comparing to that needed for a limited number of experts. However, as crowdsourcing performers are non-professional and vary in levels of expertise, such labels are much noisier than those obtained from experts. For this reason, in order to collect good quality data within a limited budget special techniques such as incremental relabelling, aggregation and pricing need to be used. We make an introduction to data labeling via public crowdsourcing marketplaces and present key components of efficient label collection. We show how to choose one of real label collection tasks, experiment with selecting settings for the labelling process, and launch label collection project at Yandex.Toloka, one of the largest crowdsourcing marketplace. The projects will be run on real crowds. We also present main algorithms for aggregation, incremental relabelling, and pricing in crowdsourcing. In particular, we, first, discuss how to connect these three components to build an efficient label collection process; and, second, share rich industrial experiences of applying these algorithms and constructing large-scale label collection pipelines (emphasizing best practices and common pitfalls).

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Alexey Drutsa (9 papers)
  2. Viktoriya Farafonova (1 paper)
  3. Valentina Fedorova (7 papers)
  4. Olga Megorskaya (1 paper)
  5. Evfrosiniya Zerminova (1 paper)
  6. Olga Zhilinskaya (1 paper)
Citations (11)

Summary

We haven't generated a summary for this paper yet.