Self-Assessment Tests are Unreliable Measures of LLM Personality (2309.08163v2)

Published 15 Sep 2023 in cs.CL and cs.AI

Abstract: As LLMs (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of "personality" of LLMs using self-assessment personality tests developed to measure human personality. Yet almost none of these works verify the applicability of these tests on LLMs. In this paper, we analyze the reliability of LLM personality scores obtained from self-assessment personality tests using two simple experiments. We first introduce the property of prompt sensitivity, where three semantically equivalent prompts representing three intuitive ways of administering self-assessment tests on LLMs are used to measure the personality of the same LLM. We find that all three prompts lead to very different personality scores, a difference that is statistically significant for all traits in a large majority of scenarios. We then introduce the property of option-order symmetry for personality measurement of LLMs. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the self-assessment test scores are not robust to the order of the options. These simple tests, done on ChatGPT and three Llama2 models of different sizes, show that self-assessment personality tests created for humans are unreliable measures of personality in LLMs.

References (45)

Citations (8)

View on Semantic Scholar

Summary

The paper shows that self-assessment tests exhibit significant sensitivity to prompt phrasing and option order, undermining their reliability for measuring LLM personality.
Experimental results across ChatGPT and Llama2 models reveal statistically significant score variations when test phrasing and ordering are altered.
The study emphasizes the need for developing alternative evaluation methods that account for the unique behavioral characteristics of LLMs.

LLM Personality Assessment: Reliability Analysis

The paper "Self-Assessment Tests are Unreliable Measures of LLM Personality" (2309.08163) critically examines the applicability of self-assessment personality tests, designed for humans, to LLMs. It identifies vulnerabilities in using these tests to measure LLM personality by demonstrating prompt sensitivity and option-order sensitivity, calling into question the reliability of previous studies in this area.

Background and Motivation

LLMs are being deployed in roles that require understanding and modeling human behavior. This has led researchers to attempt to quantify LLM behavior using tools from psychology, such as self-assessment personality tests. These tests, which typically involve Likert-type scales, are used to measure personality traits in humans. However, the paper argues that the direct application of these tests to LLMs is not validated and may be inappropriate due to inherent differences between LLMs and humans. The authors highlight that while previous work has attempted to verify the applicability of these tests, their methodologies do not adequately address characteristics unique to LLMs.

Experimental Design

The paper employs two experiments to evaluate the reliability of self-assessment tests for LLMs: prompt sensitivity and option-order sensitivity.

Prompt Sensitivity

Prompt sensitivity assesses whether semantically equivalent prompts yield similar personality scores. Three different prompts, derived from previous studies, are used to administer the same personality test questions to LLMs (Table \ref{tab:prompt_list}).

Option-Order Sensitivity

Option-order sensitivity examines whether the order in which options are presented affects test responses. The order of options in multiple-choice questions is inverted, and the direction of the measurement scale is reversed to evaluate the impact on test scores. This is motivated by findings that LLMs are sensitive to the order of options in multiple-choice questions.

Results

The experiments were conducted on ChatGPT and three Llama2 models of varying sizes. The results indicate that personality scores obtained from self-assessment tests are highly sensitive to both prompt variations and option order.

Prompt Sensitivity Findings

The paper found that semantically equivalent prompts led to statistically significant differences in personality scores for all models tested. This suggests that the measured "personality" of an LLM is heavily influenced by the specific phrasing of the questions.

Option-Order Sensitivity Findings

The results showed that reversing the order of options or the direction of the scale also led to statistically significant differences in test scores. This indicates that LLM responses are not invariant to the presentation format of the questions, unlike human responses.

Figure 1: Self assessment personality test scores for Llamav2 and ChatGPT on the IPIP-300 dataset. The prompts appended with "(R)" contain the reverse option order or scale measurement prompts as described in section \ref{sec:option-order-sensitivity}.

Figure 1 shows a comparison of self-assessment personality test scores for various LLMs using the IPIP-300 dataset.

Statistical Analysis

The statistical significance of the observed differences was assessed using the non-parametric Mann-Whitney U test. The null hypothesis, stating that the score distributions are identical, was rejected in a large majority of cases, particularly for ChatGPT.

Figure 2: Pairwise distributional difference test results for ChatGPT on IPIP-300 dataset. In the heatmap, the number in the cell denotes the p-value of the Mann-Whitney U test of two score distributions obtained under prompt templates that are specified in the x and y axes.

Figure 2 shows the pairwise distributional difference test results for ChatGPT on the IPIP-300 dataset, highlighting the statistical significance of the differences in scores across various prompt templates.

Figure 3: Summary statistics of hypothesis tests results.

Figure 3 provides summary statistics of the hypothesis tests, indicating the frequency with which the null hypothesis was rejected for each model.

Figure 4: Pairwise distributional difference test results for Llamav2-7B on IPIP 300 dataset.

Figure 4 visualizes the pairwise distributional difference test results for the Llama2-7B model, showing the statistical differences in scores across different prompt variations.

Figure 5: Pairwise distributional difference test results for Llamav2-13B on IPIP 300 dataset.

Figure 5 visualizes the pairwise distributional difference test results for the Llama2-13B model, showing the statistical differences in scores across different prompt variations.

Figure 6: Pairwise distributional difference test results for Llamav2-70B on IPIP 300 dataset.

Figure 6 visualizes the pairwise distributional difference test results for the Llama2-70B model, showing the statistical differences in scores across different prompt variations.

Implications and Conclusion

The findings strongly suggest that self-assessment tests are unreliable measures of personality in LLMs. The sensitivity to prompt variations and option order undermines the validity of using these tests to quantify LLM behavior. The paper recommends against using these instruments and encourages the research community to explore more robust measures of personality in LLMs. The authors also raise the question of whether LLMs are even capable of introspection, a prerequisite for accurately answering self-assessment questions.

Future Directions

Future research should focus on developing alternative methods for evaluating LLM personality that account for the unique characteristics of these models. Collaboration between experts in AI, psychology, and linguistics is needed to create more robust and meaningful measures of LLM behavior.