Emergent Mind

Benchmarking Mental State Representations in Language Models

(2406.17513)
Published Jun 25, 2024 in cs.CL and cs.AI

Abstract

While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models' internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models' internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models' representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models' reasoning performance by steering their activations without the need to train any probe.

Belief probing accuracy across various model architectures, sizes, and fine-tuning stages.

Overview

  • The paper investigates how various language models represent mental states, specifically focusing on theory of mind capabilities, which are crucial for improving human-computer interactions.

  • Probing experiments on models like Pythia and Llama-2 reveal that larger models and those fine-tuned with methods like instruction-tuning and RLHF show improved accuracy in representing beliefs.

  • Activation steering techniques, such as Contrastive Activation Addition (CAA), significantly enhance the models' reasoning performance, often achieving over 90% accuracy without needing dedicated probe training.

Benchmarking Mental State Representations in Language Models: A Professional Overview

The paper "Benchmarking Mental State Representations in Language Models" by Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, and Andreas Bulling presents an extensive investigation into the internal representations of mental states, specifically beliefs, within various Language Models (LMs). This research addresses the emergent area of evaluating Theory of Mind (ToM) capabilities in LMs, which is pivotal for enhancing human-computer interaction through improved understanding of human mental states.

Summary of Contributions

The research benchmarks mental state representations across a range of LM families, model sizes, fine-tuning approaches, and prompt variations. The primary contributions of the paper include:

Probing Experiments Across Diverse LMs:

  • The study encompasses both base and fine-tuned versions of the Pythia and Llama-2 models, ranging from 70 million to 70 billion parameters.
  • Probing experiments reveal that accuracy in representing others' beliefs increases with model size, and fine-tuning, particularly with instruction-tuning and RLHF, significantly enhances the quality of these representations.

Sensitivity to Prompt Variations:

  • The authors explore the robustness of LM representations to different types of prompt variations, including random token insertion, misleading prompts, time specification, and initial belief revelation.
  • It is observed that while oracle representations remain robust, representations of beliefs from another’s perspective are sensitive to prompt variations, with significant improvements when initial beliefs are included in the prompt.

Investigation of Memorization:

  • Probing accuracy is analyzed with dimensionality reduction to identify potential memorization effects. Probes trained on principal components of the activation data demonstrated that high accuracy can be retained with significantly fewer parameters, indicating low memorization issues.

Enhancing Reasoning via Activation Steering:

  • Building on previous work on activation editing, the paper demonstrates improvements in LMs' ToM performance without probe training using Contrastive Activation Addition (CAA).
  • Results showcase substantial performance gains across forward belief, forward action, and backward belief tasks, often reaching or exceeding 90% accuracy.

Analytical Insights and Implications

Probing Experiments and Fine-tuning

The probing experiments indicate a crucial aspect of model design and training: larger models inherently develop richer and more accurate internal representations of mental states. This is consistent with broader trends in NLP showing that scaling up model parameters correlates with improved performance across various tasks. The effectiveness of fine-tuning, particularly through methods involving human feedback, suggests that exposure to diverse and informative data during training is pivotal for nurturing ToM reasoning capabilities.

Sensitivity to Prompts

The sensitivity of belief representations to prompt variations underscores the necessity for careful prompt engineering when deploying LMs in real-world applications. This finding implies that achieving robust mental state understanding requires not only advanced model architectures and training regimes but also carefully designed prompts that can mitigate ambiguity and bias in generated responses.

Memorization Concerns

Addressing memorization in probing provides reassurance regarding the interpretability of probe outcomes. The absence of strong memorization effects, as evidenced by the successful use of principal components, suggests that the probes offer genuine insights into the models’ internal representations without overfitting to the training data.

The Role of Activation Steering

The success of CAA in improving reasoning performance without dedicated probe training marks a notable advancement. This technique's efficacy in steering model activations points to a practical and computationally efficient method for enhancing model performance on specific tasks, particularly those involving nuanced understanding like ToM. This could be crucial for future work aimed at achieving highly adaptable and generalizable AI systems.

Future Speculations and Theoretical Implications

The findings present pathways for future research and development. One critical direction is the extension of these benchmarks to new data sets and model architectures, given the rapid evolution of LMs. Exploring the intersection of ToM capabilities with other cognitive faculties like emotional intelligence and moral reasoning could provide a holistic framework for AI understanding of human mental states.

Furthermore, the implications of fine-tuning methodologies in shaping ToM capabilities suggest that future models could benefit from more sophisticated and specialized instruction-tuning regimes that encompass broader and deeper aspects of human cognition. This could lead to AI systems that not only understand beliefs and emotions but can also navigate complex social interactions with a higher degree of empathy and contextual awareness.

In conclusion, this paper provides extensive benchmarks and insights into how LMs represent mental states. The robustness analysis, fine-tuning impacts, and activation steering advancements contribute to a deeper understanding of the internal mechanisms that underpin ToM capabilities in AI, highlighting both current capabilities and future potential in this vital area of research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube