Emergent Mind

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

(2406.16235)
Published Jun 23, 2024 in cs.CL , cs.AI , cs.CR , and cs.LG

Abstract

Detoxifying multilingual LLMs has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

Tradeoffs between DPO learning rate, post-DPO generation toxicity, and perplexity across 17 languages.

Overview

  • The paper assesses the effectiveness of Direct Preference Optimization (DPO) in reducing toxicity in LLMs across multiple non-English languages using English training data.

  • Mechanistic analyses using techniques like causal intervention and activation analysis reveal that neurons associated with multilingual toxicity can be uniformly suppressed through DPO training.

  • The study demonstrates a strong correlation between bilingual sentence retrieval accuracy and cross-lingual toxicity reduction, validating the use of this metric for predicting generalizability.

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

LLMs are increasingly being deployed across the globe for a variety of applications, which emphasizes the need to ensure their outputs are safe and non-toxic across multiple languages. The paper "Preference Tuning For Toxicity Mitigation Generalizes Across Languages" by Xiaochen Li, Zheng-Xin Yong, and Stephen H. Bach explores the cross-lingual effectiveness of Direct Preference Optimization (DPO) as a method for detoxifying LLMs.

Research Objectives and Methods

The primary objective of this work is to test whether preference tuning using English data can reduce toxicity in LLM outputs across various non-English languages. This objective is pursued through several methodological innovations:

  1. Direct Preference Optimization (DPO): This is a preference tuning method that operates on English training data, aimed at reducing toxicity in model outputs. The authors employ DPO on various multilingual LLMs, including mGPT-1.3B, BLOOM, Llama3, and Aya-23, targeting a set of 17 different languages.
  2. Toxicity Evaluation and Metrics: The authors evaluate the toxicity of LLM outputs using Perspective API across three primary metrics: Expected Maximum Toxicity (EMT), Toxicity Probability (ToxProb), and Average Toxicity (AvgTox). Additionally, fluency and diversity metrics are computed to gauge trade-offs involved in preference tuning.
  3. Mechanistic Interpretability: Using mechanisms such as causal intervention and activation analysis, the study identifies the multilingual properties of key and value vectors in MLP layers, elucidating the cross-lingual generalizability of DPO.
  4. Prediction of Transferability: The paper proposes bilingual sentence retrieval accuracy as a predictive metric for the cross-lingual generalizability of DPO, correlating this with the effectiveness of toxicity reduction.

Results

The results are noteworthy and multifaceted:

  • Toxicity Reduction: Notable reductions in toxicity were observed across all evaluated languages. For instance, the probability of generating toxic outputs in mGPT-1.3B decreased significantly from 46.8% to 3.9%. This effectiveness was consistent across different models like BLOOM and Llama3, demonstrating the robustness of the DPO approach.
  • Mechanistic Insights: The study uncovered that neuron activations associated with toxic generation, characterized by "dual multilinguality," are inherently multilingual. The same neurons contribute to toxicity across different languages, and their activations can be suppressed uniformly across languages through DPO training on English data.
  • Predicting Generalizability: A strong positive correlation (Pearson-r = 0.732, p < 0.01) was established between bilingual sentence retrieval accuracy and the reduction in EMT. Languages with high alignment with English showed greater reductions in toxicity, validating the proposed predictive metric.

Implications and Future Work

The implications of these findings are significant both practically and theoretically:

  • Practical: The demonstrated feasibility to use English data for detoxification across languages can lower the resource barriers for multilingual safety enhancements substantially. This is particularly beneficial for lower-resourced languages where annotated toxic data is scarce.
  • Theoretical: The uncovering of dual multilinguality in MLP layers advances our understanding of how LLMs tokenize and process multilingual toxic content, providing a foundation for further mechanistic explorations into cross-lingual transfer learning.

Looking forward, several avenues for future research are evident:

  1. Extending Data Variety: While the present study focuses on a single preference-tuning method (DPO), evaluating other methods like PPO, KTO, ORPO, and CPO could establish broader methodologies for cross-lingual toxicity mitigation.
  2. Addressing Low-resource Languages: Future work should also delve into the challenges of extending these findings to even lower-resource languages, where language representation might not be as robust.
  3. Context-specific Detox: Further research could explore context-specific detoxification to see how cultural and contextual nuances impact the effectiveness of cross-lingual generalization strategies.

In conclusion, the paper provides compelling evidence supporting the cross-lingual scalability of preference tuning methods for LLM detoxification, driven by methodical experimentation and insightful mechanistic interpretations. This research represents a significant step forward in making LLMs safer for a global audience.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.