Emergent Mind

Abstract

Hate speech detection models are only as good as the data they are trained on. Datasets sourced from social media suffer from systematic gaps and biases, leading to unreliable models with simplistic decision boundaries. Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem. However, adversarial data collection can be slow and costly, and individual annotators have limited creativity. In this paper, we introduce GAHD, a new German Adversarial Hate speech Dataset comprising ca.\ 11k examples. During data collection, we explore new strategies for supporting annotators, to create more diverse adversarial examples more efficiently and provide a manual analysis of annotator disagreements for each strategy. Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness. Further, we find that mixing multiple support strategies is most advantageous. We make GAHD publicly available at https://github.com/jagol/gahd.

Workflow showing annotators validating translations of adversarial English examples in the DADC process for R2.

Overview

  • The research introduces the German Adversarial Hate Speech Dataset (GAHD) aimed at improving hate speech detection by enhancing the diversity and efficiency of adversarial examples.

  • GAHD incorporates a dynamic adversarial data collection (DADC) process over four rounds, each applying different strategies to support annotators in generating or identifying adversarial examples.

  • The dataset contains around 11,000 examples, balancing hate speech and non-hate speeches, with a focus on the German cultural context and inclusivity of marginalized groups.

  • Model evaluations on GAHD showed that training on this dataset significantly improves the robustness of hate speech detection models, challenging both commercial APIs and LLMs.

Improving Adversarial Data Collection for German Hate Speech Detection

Introduction

Detecting hate speech is a critical aspect of maintaining the safety and integrity of online spaces. Traditional datasets, derived from social media or comments sections, often contain biases that result in models lacking robustness and generalizability. This research introduces the German Adversarial Hate speech Dataset (GAHD), focusing on enhancing the diversity and efficiency of adversarial examples through unique strategies supporting annotators.

Dataset Creation and Annotation

GAHD's creation involved a dynamic adversarial data collection (DADC) process across four rounds, each employing a distinct strategy to aid annotators in crafting or identifying adversarial examples. The dataset encompasses approximately 11,000 examples, with a balanced representation of hate speech and non-hate speech categories. Notably, the annotation process included a detailed definition of hate speech tailored to the German context, emphasizing cultural nuances and inclusive of marginalized groups.

Strategies for Adversarial Data Collection

  • Unguided Example Generation: The initial round allowed annotators to freely generate examples, fostering creativity but also revealing challenges in consistently applying hate speech definitions.
  • Translation and Validation: Subsequent rounds leveraged translated adversarial examples from English datasets and sentences from German newspapers presumed to be benign but flagged by models as hate speech, providing a rich source of potential adversarial instances.
  • Contrastive Example Creation: The final round focused on generating examples expressly designed to challenge the model's predictions, refining the dataset's ability to test and enhance model robustness.

Dynamic Adversarial Data Collection Process

The iterative nature of DADC ensured continuous refinement of the target model, with each round incorporating newly collected adversarial examples into the training data. This method not only improved the dataset's quality but also allowed for an examination of different annotation support strategies on the efficiency and diversity of generated examples.

Model Evaluations and Benchmarks

GAHD presented a significant challenge to state-of-the-art hate speech detection models, including commercial APIs and LLMs. Notably, training models on GAHD resulted in substantial improvements in robustness, as evidenced by performance on both in-domain and out-of-domain test sets. The analysis also highlighted the varying effectiveness of adversarial examples generated through different support strategies, underscoring the value of mixing multiple strategies to produce a more resilient and comprehensive dataset.

Implications and Future Directions

The research demonstrates the viability and benefit of employing diversified strategies in adversarial data collection to improve hate speech detection models. By supporting annotators in generating more diverse and challenging examples, the resulting dataset offers a robust resource for training and evaluating hate speech detection models. Future work could explore additional methods for annotator support, including leveraging LLMs for augmentations and perturbations, to further enhance dataset diversity and model performance.

Conclusion

GAHD marks a significant advancement in the collection of adversarial data for hate speech detection, emphasizing the importance of diverse and efficient example generation. The strategies outlined in this paper not only contribute to the development of more robust models but also offer insights into optimizing the adversarial data collection process for future research.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.