Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Published 3 Jun 2024 in cs.CL and cs.AI | (2406.01382v1)

Abstract: What makes LLMs impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.

Abstract PDF HTML Upgrade to Chat

Authors (3)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a systematic framework to measure how large language models align with human generalizations after performance feedback.
It leverages a dataset of 18,972 examples across 79 tasks and surveys to model and predict shifts in human belief about LLM capabilities.
Results reveal that BERT-based models outperform GPT-4 in predicting human generalizations, underscoring challenges in high-stakes deployments.

Measuring the Alignment of LLMs with Human Expectations

The paper "Do LLMs Perform the Way People Expect? Measuring the Human Generalization Function" presents a systematic framework to evaluate the alignment between the capabilities of LLMs and human generalizations of these capabilities. The work addresses a crucial challenge in deploying LLMs: ensuring they perform as expected in diverse real-world applications. This concern is critical since the deployment of these models often hinges on user beliefs about where they will perform effectively.

In the framework posited by the authors, the decision to deploy an LLM is essentially an act of human generalization. After interacting with an LLM, humans form beliefs about its abilities and deploy it to perform tasks they believe it can handle successfully. Given the diverse applications of LLMs, aligning model performances with human expectations is paramount to ensure effective deployment and utilization.

The authors collect an extensive dataset of human generalizations involving 18,972 examples across 79 tasks, sourced from the MMLU and BIG-Bench benchmarks. This dataset forms the backbone for modeling the human generalization function, allowing the authors to predict how humans update their beliefs about LLM capabilities.

Empirical results are drawn from surveys conducted via the Prolific platform. The surveys measure how human beliefs evolve after they observe an LLM's performance on certain tasks. The collected data shows that human generalization is often sparse; in many cases, there is no observable change in belief about a model's capability after given performance feedback.

The paper proceeds to model these generalizations using various predictive models. Notably, BERT-based models outperformed larger yet more contemporary models like GPT-4 in predicting how human beliefs change. This suggests that simpler models may retain structures more aligned with human cognitive processes than larger, more complex models.

The implications of this research are notable, particularly when evaluating LLM alignment. More capable models, such as GPT-4, can sometimes fail to align with human generalizations, particularly in high-stakes scenarios where the consequences of errors are significant. This occurs due to overconfidence instilled in humans regarding these models' capabilities after limited interactions, leading to deployment in scenarios where they are likely to fail.

The paper's outcomes have significant implications for the future development and evaluation of LLMs. By focusing on alignment with the human generalization function, developers can create models that offer predictable and reliable performance, enhancing trust and utility in practical applications. This lays a critical groundwork for interventions that might improve alignment, such as improved interaction interfaces or explanatory mechanisms that guide users toward better understanding LLM capabilities.

This research extends beyond the technical assessment of LLM capabilities, merging aspects of human-computer interaction and cognitive psychology. The paper highlights areas for further exploration, including examining heterogeneity in generalization functions across different user demographics and understanding how various user interfaces influence generalization and subsequent task performance.

In conclusion, the paper provides a robust framework for evaluating LLM alignment with human expectations, emphasizing its importance in effective model deployment. By focusing on the alignment of LLM capabilities with human generalization, the authors reveal crucial insights into model evaluation and deployment strategies, offering directions for future research aimed at optimizing the utility of AI systems across diverse landscapes.

Markdown Report Issue