Towards Model-Based Data Acquisition for Subjective Multi-Task NLP Problems (2312.08198v1)
Abstract: Data annotated by humans is a source of knowledge by describing the peculiarities of the problem and therefore fueling the decision process of the trained model. Unfortunately, the annotation process for subjective NLP problems like offensiveness or emotion detection is often very expensive and time-consuming. One of the inevitable risks is to spend some of the funds and annotator effort on annotations that do not provide any additional knowledge about the specific task. To minimize these costs, we propose a new model-based approach that allows the selection of tasks annotated individually for each text in a multi-task scenario. The experiments carried out on three datasets, dozens of NLP tasks, and thousands of annotations show that our method allows up to 40% reduction in the number of annotations with negligible loss of knowledge. The results also emphasize the need to collect a diverse amount of data required to efficiently train a model, depending on the subjectivity of the annotation task. We also focused on measuring the relation between subjective tasks by evaluating the model in single-task and multi-task scenarios. Moreover, for some datasets, training only on the labels predicted by our model improved the efficiency of task selection as a self-supervised learning regularization technique.
- D. Brain and G. I. Webb, “On the effect of data set size on bias and variance in classification learning,” in Proceedings of the Fourth Australian Knowledge Acquisition Workshop, University of New South Wales, 1999, pp. 117–128.
- M. A. Cantrell and P. Lupinacci, “Methodological issues in online data collection,” Journal of advanced nursing, vol. 60, no. 5, pp. 544–549, 2007.
- A. M. Hutchinson, D. L. Milke, S. Maisey, C. Johnson, J. E. Squires, G. Teare, and C. A. Estabrooks, “The resident assessment instrument-minimum data set 2.0 quality indicators: a systematic review,” BMC health services research, vol. 10, no. 1, pp. 1–14, 2010.
- L. T. Rose and K. W. Fischer, “Garbage in, garbage out: Having useful data is everything,” Measurement: Interdisciplinary Research & Perspective, vol. 9, no. 4, pp. 222–226, 2011.
- K. Lyko, M. Nitzschke, and A.-C. Ngonga Ngomo, “Big data acquisition,” New Horizons for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in Europe, pp. 39–61, 2016.
- T. Wang, J.-Y. Zhu, A. Torralba, and A. A. Efros, “Dataset distillation,” arXiv preprint arXiv:1811.10959, 2018.
- D. Barrett and A. Twycross, “Data collection in qualitative research,” Evidence-based nursing, vol. 21, no. 3, pp. 63–64, 2018.
- T. Nguyen, R. Novak, L. Xiao, and J. Lee, “Dataset distillation with infinitely wide convolutional networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 5186–5198, 2021.
- B. Vidgen and L. Derczynski, “Directions in abusive language training data, a systematic review: Garbage in, garbage out,” Plos one, vol. 15, no. 12, p. e0243300, 2020.
- I. Sucholutsky and M. Schonlau, “Soft-label dataset distillation and text dataset distillation,” in 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8.
- Y. Zhou, E. Nezhadarya, and J. Ba, “Dataset distillation using neural feature regression,” Advances in Neural Information Processing Systems, vol. 35, pp. 9813–9827, 2022.
- S. Liu, K. Wang, X. Yang, J. Ye, and X. Wang, “Dataset distillation via factorization,” Advances in Neural Information Processing Systems, vol. 35, pp. 1100–1113, 2022.
- H. T. Larasati, A. T. Prihatno, H. Kim et al., “A review of dataset distillation for deep learning,” in 2022 International Conference on Platform Technology and Service (PlatCon). IEEE, 2022, pp. 34–37.
- G. Cazenavette, T. Wang, A. Torralba, A. A. Efros, and J.-Y. Zhu, “Dataset distillation by matching training trajectories,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4750–4759.
- N. Sachdeva and J. McAuley, “Data distillation: A survey,” arXiv preprint arXiv:2301.04272, 2023.
- S. Zhang, C. Chen, X. Hu, and S. Peng, “Balanced knowledge distillation for long-tailed learning,” Neurocomputing, 2023.
- F. Radenovic, A. Dubey, A. Kadian, T. Mihaylov, S. Vandenhende, Y. Patel, Y. Wen, V. Ramanathan, and D. Mahajan, “Filtering, distillation, and hard negatives for vision-language pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6967–6977.
- O. Bohdal, Y. Yang, and T. M. Hospedales, “Flexible dataset distillation: Learn labels instead of images,” in 4th Workshop on Meta-Learning at NeurIPS 2020, 2020.
- J. Gao, J. Li, H. Shan, Y. Qu, J. Z. Wang, F.-Y. Wang, and J. Zhang, “Forget less, count better: a domain-incremental self-distillation learning benchmark for lifelong crowd counting,” Frontiers of Information Technology & Electronic Engineering, vol. 24, no. 2, pp. 187–202, 2023.
- S. E. Whang, Y. Roh, H. Song, and J.-G. Lee, “Data collection and quality challenges in deep learning: A data-centric ai perspective,” The VLDB Journal, pp. 1–23, 2023.
- W.-C. Chen and W.-T. Chu, “Sssd: Self-supervised self distillation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2770–2777.
- K. Kanclerz, P. Miłkowski, and J. Kocoń, “Cross-lingual deep neural transfer learning in sentiment analysis,” Procedia Computer Science, vol. 176, pp. 128–137, 2020, knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24th International Conference KES2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S187705092031838X
- P. Miłkowski, M. Gruza, K. Kanclerz, P. Kazienko, D. Grimling, and J. Kocoń, “Personal bias in prediction of emotions elicited by textual opinions,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, 2021, pp. 248–259.
- J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling, K. Kanclerz, P. Miłkowski, and P. Kazienko, “Learning personal human biases and representations for subjective tasks in natural language processing,” in 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 2021, pp. 1168–1173.
- J. Kocoń, A. Figas, M. Gruza, D. Puchalska, T. Kajdanowicz, and P. Kazienko, “Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach,” Information Processing & Management, vol. 58, no. 5, p. 102643, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306457321001333
- J. Kocoń, P. Miłkowski, and K. Kanclerz, “Multiemo: Multilingual, multilevel, multidomain sentiment analysis corpus of consumer reviews,” in Computational Science – ICCS 2021, M. Paszynski, D. Kranzlmüller, V. V. Krzhizhanovskaya, J. J. Dongarra, and P. M. A. Sloot, Eds. Cham: Springer International Publishing, 2021, pp. 297–312.
- A. Ngo, A. Candri, T. Ferdinan, J. Kocoń, and W. Korczynski, “Studemo: A non-aggregated review dataset for personalized emotion recognition,” in Proceedings of the 1st Workshop on Perspectivist Approaches to NLP@ LREC2022, 2022, pp. 46–55.
- P. Miłkowski, S. Saganowski, M. Gruza, P. Kazienko, M. Piasecki, and J. Kocoń, “Multitask personalized recognition of emotions evoked by textual content,” in 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops). IEEE, 2022, pp. 347–352.
- K. Kanclerz, M. Gruza, K. Karanowski, J. Bielaniewicz, P. Miłkowski, J. Kocoń, and P. Kazienko, “What if ground truth is subjective? personalized deep neural hate speech detection,” in Proceedings of the 1st Workshop on Perspectivist Approaches to NLP@ LREC2022, 2022, pp. 37–45.
- J. Bielaniewicz, K. Kanclerz, P. Miłkowski, M. Gruza, K. Karanowski, P. Kazienko, and J. Kocoń, “Deep-sheep: Sense of humor extraction from embeddings in the personalized context,” in 2022 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2022, pp. 967–974.
- T. Ferdinan and J. Kocoń, “Personalized models resistant to malicious attacks for human-centered trusted ai,” in The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023). CEUR Workshop Proceedings, 2023.
- W. Mieleszczenko-Kowszewicz, K. Kanclerz, J. Bielaniewicz, M. Oleksy, M. Gruza, S. Woźniak, E. Dzięcioł, P. Kazienko, and J. Kocoń, “Capturing human perspectives in nlp: Questionnaires, annotations, and biases,” in The ECAI 2023 2nd Workshop on Perspectivist Approaches to NLP. CEUR Workshop Proceedings, 2023.
- J. Kocoń, J. Baran, K. Kanclerz, M. Kajstura, and P. Kazienko, “Differential dataset cartography: Explainable artificial intelligence in comparative personalized sentiment analysis,” in International Conference on Computational Science. Springer, 2023, pp. 148–162.
- P. Kazienko, J. Bielaniewicz, M. Gruza, K. Kanclerz, K. Karanowski, P. Miłkowski, and J. Kocoń, “Human-centered neural reasoning for subjective content processing: Hate speech, emotions, and humor,” Information Fusion, vol. 94, pp. 43–65, 2023.
- J. Kocoń, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydło, J. Baran, J. Bielaniewicz, M. Gruza, A. Janz, K. Kanclerz et al., “Chatgpt: Jack of all trades, master of none,” Information Fusion, p. 101861, 2023.
- K. Kanclerz, A. Figas, M. Gruza, T. Kajdanowicz, J. Kocon, D. Puchalska, and P. Kazienko, “Controversy and conformity: from generalized to personalized aggressiveness detection,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 5915–5926. [Online]. Available: https://aclanthology.org/2021.acl-long.460
- I. Price, J. Gifford-Moore, J. Flemming, S. Musker, M. Roichman, G. Sylvain, N. Thain, L. Dixon, and J. Sorensen, “Six attributes of unhealthy conversations,” in Proceedings of the Fourth Workshop on Online Abuse and Harms, 2020, pp. 114–124.
- V. Kolhatkar, H. Wu, L. Cavasso, E. Francis, K. Shukla, and M. Taboada, “The sfu opinion and comments corpus: A corpus for the analysis of online news comments,” Corpus Pragmatics, vol. 4, no. 2, pp. 155–190, 2020.
- C. J. Kennedy, G. Bacon, A. Sahn, and C. von Vacano, “Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application,” arXiv e-prints, p. arXiv:2009.10277, Sep. 2020.
- R. Mroczkowski, P. Rybak, A. Wróblewska, and I. Gawlik, “HerBERT: Efficiently pretrained transformer-based language model for Polish,” in Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. Kiyv, Ukraine: Association for Computational Linguistics, Apr. 2021, pp. 1–10. [Online]. Available: https://aclanthology.org/2021.bsnlp-1.1
- A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://aclanthology.org/2020.emnlp-demos.6