EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models (2312.06281v2)
Abstract: We introduce EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in LLMs. We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://eqbench.com
- \APACrefYearMonthDay2018. \BBOQ\APACrefatitleThink you have solved question answering? try arc, the ai2 reasoning challenge Think you have solved question answering? try arc, the ai2 reasoning challenge.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1803.05457. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitle8-bit optimizers via block-wise quantization 8-bit optimizers via block-wise quantization.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2110.02861. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleChatGPT outperforms humans in emotional awareness evaluations Chatgpt outperforms humans in emotional awareness evaluations.\BBCQ \APACjournalVolNumPagesFrontiers in Psychology141199058. \PrintBackRefs\CurrentBib
- \APACinsertmetastargoleman1996emotional{APACrefauthors}Goleman, D. \APACrefYearMonthDay1996. \BBOQ\APACrefatitleEmotional intelligence. Why it can matter more than IQ. Emotional intelligence. why it can matter more than iq.\BBCQ \APACjournalVolNumPagesLearning24649–50. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2020. \BBOQ\APACrefatitleMeasuring massive multitask language understanding Measuring massive multitask language understanding.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2009.03300. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleMeasuring mathematical problem solving with the math dataset Measuring mathematical problem solving with the math dataset.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2103.03874. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleThe scoring challenge of Emotional Intelligence ability tests: A Confirmatory Factor Analysis approach to model substantive and method effects using raw item scores The scoring challenge of emotional intelligence ability tests: A confirmatory factor analysis approach to model substantive and method effects using raw item scores.\BBCQ \APACjournalVolNumPagesFrontiers in Psychology13812525. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2018. \BBOQ\APACrefatitleA systematic review of the pain scales in adults: which to use? A systematic review of the pain scales in adults: which to use?\BBCQ \APACjournalVolNumPagesThe American journal of emergency medicine364707–714. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleLarge language models are zero-shot reasoners Large language models are zero-shot reasoners.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems3522199–22213. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleAlpacaeval: An automatic evaluator of instruction-following models Alpacaeval: An automatic evaluator of instruction-following models.\BBCQ \APACjournalVolNumPagesGitHub repository. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \APACrefbtitleMistralOrca: Mistral-7B Model Instruct-tuned on Filtered OpenOrcaV1 GPT-4 Dataset. Mistralorca: Mistral-7b model instruct-tuned on filtered openorcav1 gpt-4 dataset. \APAChowpublishedhttps://huggingface.co/Open-Orca/Mistral-7B-OpenOrca. \APACaddressPublisherHuggingFace. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleTruthfulqa: Measuring how models mimic human falsehoods Truthfulqa: Measuring how models mimic human falsehoods.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2109.07958. \PrintBackRefs\CurrentBib
- \APACinsertmetastarchatbot_arena_leaderboard{APACrefauthors}LMSYS. \APACrefYearMonthDay2023. \APACrefbtitleChatbot Arena Leaderboard. Chatbot arena leaderboard. {APACrefURL} https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard \APACrefnoteAccessed: 2023-12-06 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1997. \BBOQ\APACrefatitleWhat Is The Emotional Intelligence? Implications for Education What is the emotional intelligence? implications for education.\BBCQ \APACjournalVolNumPagesEmotional Development, Emotional Literacy, and Emotional Intelligence, New york: Basic books. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \BBOQ\APACrefatitleGenetic and environmental influences on emotion regulation: A twin study of cognitive reappraisal and expressive suppression. Genetic and environmental influences on emotion regulation: A twin study of cognitive reappraisal and expressive suppression.\BBCQ \APACjournalVolNumPagesEmotion175772. \PrintBackRefs\CurrentBib
- \APACinsertmetastarogurlu2021meta{APACrefauthors}Ogurlu, U. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleA meta-analytic review of emotional intelligence in gifted individuals: A multilevel analysis A meta-analytic review of emotional intelligence in gifted individuals: A multilevel analysis.\BBCQ \APACjournalVolNumPagesPersonality and Individual Differences171110503. \PrintBackRefs\CurrentBib
- \APACinsertmetastaropenai2023gpt4{APACrefauthors}OpenAI. \APACrefYearMonthDay2023. \APACrefbtitleGPT-4 Technical Report. Gpt-4 technical report. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1990. \BBOQ\APACrefatitleEmotional intelligence Emotional intelligence.\BBCQ \APACjournalVolNumPagesImagination, cognition and personality93185–211. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleLlama 2: Open foundation and fine-tuned chat models Llama 2: Open foundation and fine-tuned chat models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.09288. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2003. \BBOQ\APACrefatitleSocioeconomic status modifies heritability of IQ in young children Socioeconomic status modifies heritability of iq in young children.\BBCQ \APACjournalVolNumPagesPsychological science146623–628. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \BBOQ\APACrefatitleAttention is all you need Attention is all you need.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems30. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2008. \BBOQ\APACrefatitleA behavioral genetic study of trait emotional intelligence. A behavioral genetic study of trait emotional intelligence.\BBCQ \APACjournalVolNumPagesEmotion85635. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleEmotional intelligence of large language models Emotional intelligence of large language models.\BBCQ \APACjournalVolNumPagesJournal of Pacific Rim Psychology1718344909231213958. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe dawn of lmms: Preliminary explorations with gpt-4v (ision) The dawn of lmms: Preliminary explorations with gpt-4v (ision).\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2309.1742191. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleTree of thoughts: Deliberate problem solving with large language models Tree of thoughts: Deliberate problem solving with large language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.10601. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2019. \BBOQ\APACrefatitleHellaswag: Can a machine really finish your sentence? Hellaswag: Can a machine really finish your sentence?\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1905.07830. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleJudging LLM-as-a-judge with MT-Bench and Chatbot Arena Judging llm-as-a-judge with mt-bench and chatbot arena.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.05685. \PrintBackRefs\CurrentBib