Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual and Multi-Aspect Hate Speech Analysis (1908.11049v1)

Published 29 Aug 2019 in cs.CL

Abstract: Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual multi-aspect hate speech analysis dataset and use it to test the current state-of-the-art multilingual multitask learning approaches. We evaluate our dataset in various classification settings, then we discuss how to leverage our annotations in order to improve hate speech detection and classification in general.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nedjma Ousidhoum (17 papers)
  2. Zizheng Lin (6 papers)
  3. Hongming Zhang (111 papers)
  4. Yangqiu Song (196 papers)
  5. Dit-Yan Yeung (78 papers)
Citations (264)

Summary

  • The paper introduces a novel multilingual dataset of 13,000 tweets annotated across five hate speech aspects, enabling nuanced cultural and linguistic analysis.
  • The study applies machine learning models, including logistic regression, bidirectional LSTMs, and Sluice networks, to manage multilingual multitask challenges.
  • Experimental results indicate that single-language models outperform binary tasks while multilingual multitask approaches offer marginal gains in multilabel classification.

Multilingual and Multi-Aspect Hate Speech Analysis

The paper "Multilingual and Multi-Aspect Hate Speech Analysis," authored by Nedjma Ousidhoum et al., addresses the growing challenge of detecting and classifying hate speech across different languages on social media platforms. The research introduces a novel dataset that captures hate speech spanning English, French, and Arabic, considering multiple aspects of hate speech analysis. This approach diverges from prior research that predominantly focuses on monolingual and single-aspect classification tasks.

Research Overview

The researchers present a comprehensive dataset comprising 13,000 tweets annotated for five key aspects: the directness of speech, hostility type, target attribute, target group, and the annotator's sentiment. This multi-aspect, multilingual dataset aims to enhance the granularity and cultural relevance of hate speech analysis.

One of the critical aspects of their methodology is the annotation schema, which encompasses various dimensions of hate speech. This schema captures linguistic nuances and annotator reactions, emphasizing aspects such as whether tweets are direct or indirect, various hostility types (e.g., abusive, disrespectful, fearful), the basis of discrimination (e.g., origin, gender), and how annotators emotionally react to these tweets.

Experimental Setup and Results

The research employs multiple machine learning frameworks, comparing traditional models like logistic regression with advanced deep learning approaches, specifically bidirectional LSTMs and Sluice networks, capable of multitask and multilingual learning. The findings suggest that multilingual multitask models, although not universally superior, show potential in scenarios where multilingual and multitask learning benefits from shared parameters across loosely related tasks.

The experimental results indicate that single-task, single-LLMs outperform others in binary classification of directness due to the simpler nature and balanced distribution of labels. However, multitask multilingual settings provide marginal improvements in multilabel tasks, hinting at the possibility of leveraging shared linguistic and semantic features across languages.

Implications and Future Directions

The paper effectively demonstrates the complexity inherent in hate speech detection due to linguistic and cultural variations. The dataset and findings encourage further exploration into sophisticated multilingual multitask models to address these complexities. Future work might focus on enhancing multilingual embeddings, incorporating multimodal data, or using transfer learning techniques to capture subtleties that single-LLMs may overlook.

Furthermore, the inclusion of annotator sentiment in the analysis provides a novel dimension, offering insights into human emotional responses to hate speech, which could enhance contextual understanding in automatic detection systems.

In sum, this research significantly contributes to the field by providing a robust resource for multilingual hate speech analysis and setting a foundation for future innovations in handling linguistic diversity and nuanced hate speech classification. Its findings underline the importance of considering multiple languages and aspects in developing comprehensive hate speech identification systems.