Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Self-Assessment Tests are Unreliable Measures of LLM Personality (2309.08163v2)

Published 15 Sep 2023 in cs.CL and cs.AI

Abstract: As LLMs (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of "personality" of LLMs using self-assessment personality tests developed to measure human personality. Yet almost none of these works verify the applicability of these tests on LLMs. In this paper, we analyze the reliability of LLM personality scores obtained from self-assessment personality tests using two simple experiments. We first introduce the property of prompt sensitivity, where three semantically equivalent prompts representing three intuitive ways of administering self-assessment tests on LLMs are used to measure the personality of the same LLM. We find that all three prompts lead to very different personality scores, a difference that is statistically significant for all traits in a large majority of scenarios. We then introduce the property of option-order symmetry for personality measurement of LLMs. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the self-assessment test scores are not robust to the order of the options. These simple tests, done on ChatGPT and three Llama2 models of different sizes, show that self-assessment personality tests created for humans are unreliable measures of personality in LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. American Psychological Association. 2023. Definition of Personality - https://www.apa.org/topics/personality.
  2. Personality testing of gpt-3: Limited temporal reliability, but highlighted social desirability of gpt-3’s personality instruments results. arXiv preprint arXiv:2306.04308.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Graham Caron and Shashank Srivastava. 2022. Identifying and manipulating the personality traits of language models. arXiv preprint arXiv:2212.10276.
  5. Raymond B Cattell. 1943a. The description of personality: Basic traits resolved into clusters. The journal of abnormal and social psychology, 38(4):476.
  6. Raymond B Cattell. 1943b. The description of personality. i. foundations of trait measurement. Psychological review, 50(6):559.
  7. When large language models meet personalization: Perspectives of challenges and opportunities. arXiv preprint arXiv:2307.16376.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  9. Lee Anna Clark and David Watson. 2019. Constructing validity: New developments in creating objective measuring instruments. Psychological assessment, 31(12):1412.
  10. Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests. psychometrika, 16(3):297–334.
  11. Boele De Raad. 2000. The big five personality factors: the psycholexical approach to personality. Hogrefe & Huber Publishers.
  12. John M Digman. 1990. Personality structure: Emergence of the five-factor model. Annual review of psychology, 41(1):417–440.
  13. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130.
  14. Lewis R Goldberg. 1990. An alternative" description of personality": the big-five factor structure. Journal of personality and social psychology, 59(6):1216.
  15. Lewis R Goldberg. 1993. The structure of phenotypic personality traits. American psychologist, 48(1):26.
  16. Louis Guttman. 1945. A basis for analyzing test-retest reliability. Psychometrika, 10(4):255–282.
  17. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  18. Chatgpt an enfj, bard an istj: Empirical study on personalities of large language models. arXiv preprint arXiv:2305.19926.
  19. Jaeho Jeon and Seongyong Lee. 2023. Large language models in education: A focus on the complementary relationship between human teachers and chatgpt. Education and Information Technologies, pages 1–20.
  20. Mpi: Evaluating and inducing personality in pre-trained language models. arXiv preprint arXiv:2206.07550.
  21. John A Johnson. 2014. Measuring thirty facets of the five factor model with a 120-item public domain inventory: Development of the ipip-neo-120. Journal of research in personality, 51:78–89.
  22. Estimating the personality of white-box language models. arXiv e-prints, pages arXiv–2204.
  23. Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of psychology.
  24. Roderick P McDonald. 2013. Test theory: A unified treatment. psychology press.
  25. Samuel Messick. 1998. Test validity: A matter of consequence. Social Indicators Research, 45:35–44.
  26. Who is gpt-3? an exploration of personality, values and demographics. arXiv preprint arXiv:2209.14338.
  27. Nadim Nachar et al. 2008. The mann-whitney u: A test for assessing whether two independent samples come from the same distribution. Tutorials in quantitative Methods for Psychology, 4(1):13–20.
  28. David Noever and Sam Hyams. 2023. Ai text-to-behavior: A study in steerability. arXiv preprint arXiv:2308.07326.
  29. OpenAI. 2022. Chatgpt - https://openai.com/blog/chatgpt#OpenAI.
  30. OpenAI. 2023. Gpt-4 technical report - https://cdn.openai.com/papers/gpt-4.pdf.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  32. Keyu Pan and Yawen Zeng. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180.
  33. Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483.
  34. Improving language understanding by generative pre-training.
  35. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  36. Beatrice Rammstedt and Dagmar Krebs. 2007. Does response scale format affect the answering of personality scales? assessing the big five dimensions of personality with different response scales in a dependent sample. European Journal of Psychological Assessment, 23(1):32–38.
  37. Effects of response option order on likert-type psychometric properties and reactions. Educational and Psychological Measurement, 82(6):1107–1129.
  38. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353.
  39. Personality traits in large language models. arXiv preprint arXiv:2307.00184.
  40. Have large language models developed a personality?: Applicability of self-assessment tests in measuring personality in llms. arXiv preprint arXiv:2305.14693.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  43. Jerry S Wiggins. 1996. The five-factor model of personality: Theoretical perspectives. Guilford Press.
  44. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852.
  45. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
Citations (8)

Summary

  • The paper shows that self-assessment tests exhibit significant sensitivity to prompt phrasing and option order, undermining their reliability for measuring LLM personality.
  • Experimental results across ChatGPT and Llama2 models reveal statistically significant score variations when test phrasing and ordering are altered.
  • The study emphasizes the need for developing alternative evaluation methods that account for the unique behavioral characteristics of LLMs.

LLM Personality Assessment: Reliability Analysis

The paper "Self-Assessment Tests are Unreliable Measures of LLM Personality" (2309.08163) critically examines the applicability of self-assessment personality tests, designed for humans, to LLMs. It identifies vulnerabilities in using these tests to measure LLM personality by demonstrating prompt sensitivity and option-order sensitivity, calling into question the reliability of previous studies in this area.

Background and Motivation

LLMs are being deployed in roles that require understanding and modeling human behavior. This has led researchers to attempt to quantify LLM behavior using tools from psychology, such as self-assessment personality tests. These tests, which typically involve Likert-type scales, are used to measure personality traits in humans. However, the paper argues that the direct application of these tests to LLMs is not validated and may be inappropriate due to inherent differences between LLMs and humans. The authors highlight that while previous work has attempted to verify the applicability of these tests, their methodologies do not adequately address characteristics unique to LLMs.

Experimental Design

The paper employs two experiments to evaluate the reliability of self-assessment tests for LLMs: prompt sensitivity and option-order sensitivity.

Prompt Sensitivity

Prompt sensitivity assesses whether semantically equivalent prompts yield similar personality scores. Three different prompts, derived from previous studies, are used to administer the same personality test questions to LLMs (Table \ref{tab:prompt_list}).

Option-Order Sensitivity

Option-order sensitivity examines whether the order in which options are presented affects test responses. The order of options in multiple-choice questions is inverted, and the direction of the measurement scale is reversed to evaluate the impact on test scores. This is motivated by findings that LLMs are sensitive to the order of options in multiple-choice questions.

Results

The experiments were conducted on ChatGPT and three Llama2 models of varying sizes. The results indicate that personality scores obtained from self-assessment tests are highly sensitive to both prompt variations and option order.

Prompt Sensitivity Findings

The paper found that semantically equivalent prompts led to statistically significant differences in personality scores for all models tested. This suggests that the measured "personality" of an LLM is heavily influenced by the specific phrasing of the questions.

Option-Order Sensitivity Findings

The results showed that reversing the order of options or the direction of the scale also led to statistically significant differences in test scores. This indicates that LLM responses are not invariant to the presentation format of the questions, unlike human responses. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Self assessment personality test scores for Llamav2 and ChatGPT on the IPIP-300 dataset. The prompts appended with "(R)" contain the reverse option order or scale measurement prompts as described in section \ref{sec:option-order-sensitivity}.

Figure 1 shows a comparison of self-assessment personality test scores for various LLMs using the IPIP-300 dataset.

Statistical Analysis

The statistical significance of the observed differences was assessed using the non-parametric Mann-Whitney U test. The null hypothesis, stating that the score distributions are identical, was rejected in a large majority of cases, particularly for ChatGPT. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Pairwise distributional difference test results for ChatGPT on IPIP-300 dataset. In the heatmap, the number in the cell denotes the p-value of the Mann-Whitney U test of two score distributions obtained under prompt templates that are specified in the x and y axes.

Figure 2 shows the pairwise distributional difference test results for ChatGPT on the IPIP-300 dataset, highlighting the statistical significance of the differences in scores across various prompt templates. Figure 3

Figure 3: Summary statistics of hypothesis tests results.

Figure 3 provides summary statistics of the hypothesis tests, indicating the frequency with which the null hypothesis was rejected for each model. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Pairwise distributional difference test results for Llamav2-7B on IPIP 300 dataset.

Figure 4 visualizes the pairwise distributional difference test results for the Llama2-7B model, showing the statistical differences in scores across different prompt variations. Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: Pairwise distributional difference test results for Llamav2-13B on IPIP 300 dataset.

Figure 5 visualizes the pairwise distributional difference test results for the Llama2-13B model, showing the statistical differences in scores across different prompt variations. Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Pairwise distributional difference test results for Llamav2-70B on IPIP 300 dataset.

Figure 6 visualizes the pairwise distributional difference test results for the Llama2-70B model, showing the statistical differences in scores across different prompt variations.

Implications and Conclusion

The findings strongly suggest that self-assessment tests are unreliable measures of personality in LLMs. The sensitivity to prompt variations and option order undermines the validity of using these tests to quantify LLM behavior. The paper recommends against using these instruments and encourages the research community to explore more robust measures of personality in LLMs. The authors also raise the question of whether LLMs are even capable of introspection, a prerequisite for accurately answering self-assessment questions.

Future Directions

Future research should focus on developing alternative methods for evaluating LLM personality that account for the unique characteristics of these models. Collaboration between experts in AI, psychology, and linguistics is needed to create more robust and meaningful measures of LLM behavior.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.