Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
60 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quality and Efficiency of Manual Annotation: Pre-annotation Bias (2306.09307v1)

Published 15 Jun 2023 in cs.CL

Abstract: This paper presents an analysis of annotation using an automatic pre-annotation for a mid-level annotation complexity task -- dependency syntax annotation. It compares the annotation efforts made by annotators using a pre-annotated version (with a high-accuracy parser) and those made by fully manual annotation. The aim of the experiment is to judge the final annotation quality when pre-annotation is used. In addition, it evaluates the effect of automatic linguistically-based (rule-formulated) checks and another annotation on the same data available to the annotators, and their influence on annotation quality and efficiency. The experiment confirmed that the pre-annotation is an efficient tool for faster manual syntactic annotation which increases the consistency of the resulting annotation without reducing its quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555–596.
  2. (2021). Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data.
  3. (2001). Facilitating treebank annotation using a statistical parser. In Proceedings of the First International Conference on Human Language Technology Research.
  4. (2006). A semi-automatic method for annotating a biomedical Proposition Bank. In Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 5–12, Sydney, Australia, July. Association for Computational Linguistics.
  5. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  6. (2002). Designing Monte Carlo Implementations of Permutation or Bootstrap Hypothesis Tests. The American Statistician, 56(1):63–70.
  7. (2010). Influence of pre-annotation on POS-tagged corpus development. In Proceedings of the Fourth Linguistic Annotation Workshop, pages 56–63, Uppsala, Sweden, July. Association for Computational Linguistics.
  8. (2014). Optimizing annotation efforts to build reliable annotated corpora for training statistical models. In Proceedings of LAW VIII-The 8th Linguistic Annotation Workshop, pages 54–58.
  9. (2010). Partial parsing as a method to expedite dependency annotation of a Hindi treebank. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May. European Language Resources Association (ELRA).
  10. (1999). Annotation at analytical level. Technical report, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic.
  11. (2020). Prague Dependency Treebank - Consolidated 1.0. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5208–5218, Marseille, France, May. European Language Resources Association.
  12. Hajič, J. (1998). Building a syntactically annotated corpus: The Prague Dependency Treebank. In Issues of Valency and Meaning. Studies in Honour of Jarmila Panevová (ed. Eva Hajičová). Karolinum, Charles University Press, Prague, ISBN 80-7184-601-5.
  13. (1993). Building a large annotated corpus of English: The Penn Treebank.
  14. (2010). Ways of evaluation of the annotators in building the Prague Czech-English Dependency Treebank. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pages 1836–1839, Valletta, Malta. European Language Resources Association.
  15. (2006). Annotation on the tectogrammatical level in the Prague Dependency Treebank. Annotation manual. Technical Report 30, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic.
  16. Mikulová, M. (2014). Annotation on the tectogrammatical level. Additions to annotation manual (with respect to PDTSC and PCEDT). Technical Report TR-2013-52, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic.
  17. Nguyen, K.-H. (2018). BKTreebank: Building a Vietnamese Dependency Treebank. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language Resources Association (ELRA).
  18. (2008). Recent advances in a feature-rich framework for treebank annotation. In Donia Scott et al., editors, The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, volume 2, pages 673–680, Manchester, UK. The Coling 2008 Organizing Committee.
  19. (2009). Assessing the benefits of partial automatic pre-labeling for frame-semantic annotation. In Proceedings of the Third Linguistic Annotation Workshop (LAW III), pages 19–26, Suntec, Singapore, August. Association for Computational Linguistics.
  20. (2013). Automatic named entity pre-annotation for out-of-domain human annotation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 168–177, Sofia, Bulgaria, August. Association for Computational Linguistics.
  21. (1986). The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague/Dordrecht.
  22. Skjærholt, A. (2014). A chance-corrected measure of inter-annotator agreement for syntax. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 934–944, Baltimore, Maryland. Association for Computational Linguistics.
  23. Šmilauer, V. (1947). Novočeská skladba (Syntax of Modern Czech). Prague: Academia.
  24. (2021). Robeczech: Czech RoBERTa, a monolingual contextualized language representation model. In Kamil Ekštein, et al., editors, Text, Speech, and Dialogue, pages 197–209, Cham. Springer International Publishing.
  25. (2011). A double-blind experiment on interannotator agreement: The case of dependency syntax and Finnish. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), pages 319–322.
  26. (2018). SciDTB: Discourse dependency TreeBank for scientific abstracts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 444–449, Melbourne, Australia, July. Association for Computational Linguistics.
Citations (6)

Summary

We haven't generated a summary for this paper yet.