Quality and Efficiency of Manual Annotation: Pre-annotation Bias (2306.09307v1)
Abstract: This paper presents an analysis of annotation using an automatic pre-annotation for a mid-level annotation complexity task -- dependency syntax annotation. It compares the annotation efforts made by annotators using a pre-annotated version (with a high-accuracy parser) and those made by fully manual annotation. The aim of the experiment is to judge the final annotation quality when pre-annotation is used. In addition, it evaluates the effect of automatic linguistically-based (rule-formulated) checks and another annotation on the same data available to the annotators, and their influence on annotation quality and efficiency. The experiment confirmed that the pre-annotation is an efficient tool for faster manual syntactic annotation which increases the consistency of the resulting annotation without reducing its quality.
- (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555–596.
- (2021). Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data.
- (2001). Facilitating treebank annotation using a statistical parser. In Proceedings of the First International Conference on Human Language Technology Research.
- (2006). A semi-automatic method for annotating a biomedical Proposition Bank. In Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 5–12, Sydney, Australia, July. Association for Computational Linguistics.
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
- (2002). Designing Monte Carlo Implementations of Permutation or Bootstrap Hypothesis Tests. The American Statistician, 56(1):63–70.
- (2010). Influence of pre-annotation on POS-tagged corpus development. In Proceedings of the Fourth Linguistic Annotation Workshop, pages 56–63, Uppsala, Sweden, July. Association for Computational Linguistics.
- (2014). Optimizing annotation efforts to build reliable annotated corpora for training statistical models. In Proceedings of LAW VIII-The 8th Linguistic Annotation Workshop, pages 54–58.
- (2010). Partial parsing as a method to expedite dependency annotation of a Hindi treebank. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May. European Language Resources Association (ELRA).
- (1999). Annotation at analytical level. Technical report, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic.
- (2020). Prague Dependency Treebank - Consolidated 1.0. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5208–5218, Marseille, France, May. European Language Resources Association.
- Hajič, J. (1998). Building a syntactically annotated corpus: The Prague Dependency Treebank. In Issues of Valency and Meaning. Studies in Honour of Jarmila Panevová (ed. Eva Hajičová). Karolinum, Charles University Press, Prague, ISBN 80-7184-601-5.
- (1993). Building a large annotated corpus of English: The Penn Treebank.
- (2010). Ways of evaluation of the annotators in building the Prague Czech-English Dependency Treebank. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pages 1836–1839, Valletta, Malta. European Language Resources Association.
- (2006). Annotation on the tectogrammatical level in the Prague Dependency Treebank. Annotation manual. Technical Report 30, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic.
- Mikulová, M. (2014). Annotation on the tectogrammatical level. Additions to annotation manual (with respect to PDTSC and PCEDT). Technical Report TR-2013-52, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic.
- Nguyen, K.-H. (2018). BKTreebank: Building a Vietnamese Dependency Treebank. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language Resources Association (ELRA).
- (2008). Recent advances in a feature-rich framework for treebank annotation. In Donia Scott et al., editors, The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, volume 2, pages 673–680, Manchester, UK. The Coling 2008 Organizing Committee.
- (2009). Assessing the benefits of partial automatic pre-labeling for frame-semantic annotation. In Proceedings of the Third Linguistic Annotation Workshop (LAW III), pages 19–26, Suntec, Singapore, August. Association for Computational Linguistics.
- (2013). Automatic named entity pre-annotation for out-of-domain human annotation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 168–177, Sofia, Bulgaria, August. Association for Computational Linguistics.
- (1986). The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague/Dordrecht.
- Skjærholt, A. (2014). A chance-corrected measure of inter-annotator agreement for syntax. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 934–944, Baltimore, Maryland. Association for Computational Linguistics.
- Šmilauer, V. (1947). Novočeská skladba (Syntax of Modern Czech). Prague: Academia.
- (2021). Robeczech: Czech RoBERTa, a monolingual contextualized language representation model. In Kamil Ekštein, et al., editors, Text, Speech, and Dialogue, pages 197–209, Cham. Springer International Publishing.
- (2011). A double-blind experiment on interannotator agreement: The case of dependency syntax and Finnish. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), pages 319–322.
- (2018). SciDTB: Discourse dependency TreeBank for scientific abstracts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 444–449, Melbourne, Australia, July. Association for Computational Linguistics.