Emergent Mind

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

(2407.00219)
Published Jun 28, 2024 in cs.CL and cs.AI

Abstract

We study how well LLMs explain their generations with rationales -- a set of tokens extracted from the input texts that reflect the decision process of LLMs. We examine LLM rationales extracted with two methods: 1) attribution-based methods that use attention or gradients to locate important tokens, and 2) prompting-based methods that guide LLMs to extract rationales using prompts. Through extensive experiments, we show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales, and demonstrate reasonable alignment with humans even when model performance is poor. We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions. By fine-tuning these models on the corresponding datasets, both prompting and attribution methods demonstrate improved faithfulness. Our study sheds light on more rigorous and fair evaluations of LLM rationales, especially for prompting-based ones.

Analysis methodology on e-SNLI: human alignment comparison, model faithfulness, rationale masking effects.

Overview

  • The paper examines how LLMs explain their predictions through rationales, comparing human alignment and model faithfulness of these rationales.

  • It employs five advanced LLMs across two datasets and contrasts attribution-based methods with prompting-based methods for rationale extraction.

  • Findings indicate that prompting-based methods generally offer better human alignment, while fine-tuning the models significantly enhances both alignment and faithfulness, albeit highlighting the limitations before fine-tuning.

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

The paper titled "Evaluating Human Alignment and Model Faithfulness of LLM Rationale" addresses a critical aspect of interpretability within the realm of LLMs, specifically focusing on how well these models can explain their predictions through rationales. Rationales are defined as sets of tokens from input texts that reflect the decision-making process of the LLMs. The paper evaluates these rationales based on two key properties: human alignment and model faithfulness. Through extensive experimentation, the authors compare attribution-based and prompting-based methods of extracting rationales.

Methodology Overview

The study employs five state-of-the-art LLMs, both open-source (e.g., Llama2, Llama3, Mistral) and proprietary (e.g., GPT-3.5-Turbo, GPT-4-Turbo), across two annotated datasets: e-SNLI and MedicalBios. Attribution-based methods leverage inner model mechanisms like attention weights and input gradients to locate important tokens. In contrast, prompting-based methods use carefully crafted prompts to guide the LLMs in generating rationales.

Alignment and Faithfulness Evaluation

The human alignment of rationales is measured by comparing them with human-annotated rationales using the F1 score. In contrast, model faithfulness is evaluated through perturbation-based experiments which measure the flip rate—the frequency of changes in model predictions when identified important tokens are masked.

Key Findings

  1. Prompts vs. Attribution Methods: Prompting-based methods generally outperform attribution-based methods in human alignment across both datasets and models. Specifically, tailored prompts (short or normal) resulted in superior alignment scores. However, these prompting methods showed variability across different models and datasets, highlighting the sensitivity of these approaches to prompt design.
  2. Fine-tuning Effects: Fine-tuning LLMs on specific datasets significantly improves both human alignment and faithfulness. This is particularly notable for models like Llama-2 and Mistral, which displayed near-random performance on the e-SNLI dataset when used out-of-box. Fine-tuning increases the ability of models to align with human rationales and lead to more faithful rationales as assessed by perturbation experiments.
  3. Faithfulness Limitations Pre-Fine-tuning: Before fine-tuning, the models displayed minimal changes in prediction when important tokens in the input sentences were masked, suggesting a lack of genuine interpretability. This limitation is attributed to models focusing disproportionately on instructional tokens rather than the input text itself.
  4. Comparative Faithfulness: After fine-tuning, attribution-based methods generally provided more faithful rationales compared to prompting-based methods. Additionally, human rationales induced higher flip rates than model-generated rationales, which underscores the gap in interpretability and the potential for further method improvements.

Implications and Future Directions

The results of this study have significant implications for both the practical deployment and theoretical understanding of LLMs:

  • Deployment in High-Stakes Scenarios: The inability of LLMs to provide human-aligned and faithful rationales precludes their reliable deployment in high-stakes applications. This necessitates strategies for better fine-tuning and potentially new methods of rationale extraction.
  • Refinement of Explanation Methods: There is a clear need for refining prompting strategies and developing new attribution-based methods that can more accurately capture and convey the decision-making processes of LLMs.
  • Investigating Instruction Adherence: The reliance on instructional tokens reveals a deeper issue with how LLMs parse and prioritize different parts of input. Future research should explore methods that encourage models to focus more on the input text and less on repeated instructional cues.

Conclusion

This comprehensive survey of LLM rationales, focusing on human alignment and faithfulness, provides crucial insights and diagnostic evaluations. By highlighting both the strengths and limitations of existing methods, this paper lays the groundwork for future advancements in making LLMs not only powerful but also transparent and trustworthy. Fine-tuning emerges as a pivotal step in improving model interpretability, suggesting that continuous development in this direction will be vital for the more responsible use of LLMs.

In summary, the paper "Evaluating Human Alignment and Model Faithfulness of LLM Rationale" represents a significant empirical analysis that addresses current pitfalls in the interpretability of LLMs, providing a pathway for subsequent research efforts and practical improvements in model deployment.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.