Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AtteSTNet -- An attention and subword tokenization based approach for code-switched text hate speech detection (2112.11479v4)

Published 10 Dec 2021 in cs.CL and cs.LG

Abstract: Recent advancements in technology have led to a boost in social media usage which has ultimately led to large amounts of user-generated data which also includes hateful and offensive speech. The language used in social media is often a combination of English and the native language in the region. In India, Hindi is used predominantly and is often code-switched with English, giving rise to the Hinglish (Hindi+English) language. Various approaches have been made in the past to classify the code-mixed Hinglish hate speech using different machine learning and deep learning-based techniques. However, these techniques make use of recurrence on convolution mechanisms which are computationally expensive and have high memory requirements. Past techniques also make use of complex data processing making the existing techniques very complex and non-sustainable to change in data. Proposed work gives a much simpler approach which is not only at par with these complex networks but also exceeds performance with the use of subword tokenization algorithms like BPE and Unigram, along with multi-head attention-based techniques, giving an accuracy of 87.41% and an F1 score of 0.851 on standard datasets. Efficient use of BPE and Unigram algorithms help handle the nonconventional Hinglish vocabulary making the proposed technique simple, efficient and sustainable to use in the real world.

Summary

We haven't generated a summary for this paper yet.