DetoxLLM: A Framework for Detoxification with Explanations (2402.15951v2)

Published 25 Feb 2024 in cs.LG, cs.CL, and cs.CY

Abstract: Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of developing detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. Additionally, these works do not address non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified without altering the meaning. We propose DetoxLLM, the first comprehensive end-to-end detoxification framework, which attempts to alleviate the aforementioned limitations. We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies leveraging ChatGPT. We then train a suite of detoxification models with our cross-platform corpus. We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus. We further introduce explanation to promote transparency and trustworthiness. DetoxLLM additionally offers a unique paraphrase detector especially dedicated for the detoxification task to tackle the non-detoxifiable cases. Through experimental analysis, we demonstrate the effectiveness of our cross-platform corpus and the robustness of DetoxLLM against adversarial toxicity.

References (89)

Citations (2)

View on Semantic Scholar

Summary

The paper presents a comprehensive detoxification framework that leverages pseudo-parallel corpora generated with ChatGPT to achieve robust cross-platform performance.
It incorporates transparent explanation mechanisms that clarify toxic content identification, thereby fostering trust and understanding among users.
A dedicated paraphrase detector distinguishes non-detoxifiable cases, ensuring that message integrity is maintained even during moderation.

Comprehensive Framework for Cross-Platform Detoxification and Handling Non-Detoxifiability

Introduction to GreenLLaMA

In the evolving landscape of online communication, addressing toxic language has become imperative. The proliferation of such content across different platforms underscores the need for versatile detoxification strategies that not only mitigate toxicity but also preserve the integrity of the original message. GreenLLaMA emerges as a pioneering framework aimed at addressing these challenges. It introduces a comprehensive end-to-end solution for detoxifying online content, transcending the limitations of existing models. Specifically, it navigates the intricacies of cross-platform detoxification, elucidates the motivations behind toxic expressions, and adeptly handles non-detoxifiable content.

Cross-Platform Detoxification

GreenLLaMA delineates a cross-platform approach to detoxification, addressing the linguistic variability inherent across different social media platforms. By leveraging ChatGPT for data generation, this framework develops a pseudo-parallel corpus that encapsulates a diverse set of toxic and non-toxic interactions. This corpus stands as a cornerstone for training detoxification models, ensuring they exhibit robust performance across platforms. Such an approach not only broadens the applicability of detoxification models but also enhances their adaptability to platform-specific linguistic nuances.

Transparency through Explanation

A novel aspect of GreenLLaMA is its commitment to transparency. This framework distinctly incorporates explanations for identifying content as toxic, thus fostering trust and clarity. By doing so, it not only aids in the immediate detoxification process but also contributes to a broader understanding of what constitutes harmful language. This feature is instrumental in educating users and platforms alike, promoting healthier online interactions.

Tackling Non-Detoxifiability

GreenLLaMA acknowledges and addresses the challenge of non-detoxifiability—a scenario where detoxifying content compromises its original meaning. To this end, it integrates a dedicated paraphrase detector that distinguishes between detoxifiable and non-detoxifiable cases. In instances of non-detoxifiability, GreenLLaMA provides warnings, deftly navigating the delicate balance between content moderation and preserving communicative intent.

Empirical Validation

Experimental analyses underscore GreenLLaMA's efficacy. The framework demonstrates superior performance in cross-platform detoxification, outpacing state-of-the-art models while maintaining content integrity and fluency. Additionally, its unique paraphrase detector exhibits remarkable precision in identifying non-detoxifiability, highlighting the framework's nuanced understanding of content moderation challenges.

Implications and Future Directions

GreenLLaMA's contributions extend beyond immediate practical applications. The framework sets a precedent for integrating explainability and handling non-detoxifiability in content moderation tasks. Its cross-platform applicability signifies a step towards universal detoxification solutions, adaptable across the diverse landscape of online platforms. Future research may explore refining explanation mechanisms and further enhancing the robustness of detoxification models against evolving forms of toxic language.

GreenLLaMA heralds a new era in content moderation, echoing the need for comprehensive, adaptable, and transparent detoxification strategies. Its pioneering approach to tackling online toxicity, coupled with its embrace of cross-platform challenges and commitment to transparency, positions the framework as a cornerstone in the ongoing endeavor to cultivate healthier online communities.

PDF Markdown

Tweets

https://twitter.com/tawkat97/status/1767337784795414845