The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers (2404.02806v2)
Abstract: Evaluation of LLMs for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=243) using RealHumanEval in which users interacted with seven LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional -- a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better proxy signals. We open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Amazon. Ml-powered coding companion – amazon codewhisperer, 2022. URL https://aws.amazon.com/codewhisperer/.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Grounded copilot: How programmers interact with code-generating models. arXiv preprint arXiv:2206.15000, 2022.
- Taking flight with copilot: Early insights and opportunities of ai-powered pair-programming tools. Queue, 20(6):35–57, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
- Conversational challenges in ai-powered data science: Obstacles, needs, and design opportunities. arXiv preprint arXiv:2310.16164, 2023.
- Aligning offline metrics and human judgments of value for code generation models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8516–8528, 2023.
- Large language models of code fail at completing code with potential bugs. arXiv preprint arXiv:2306.03438, 2023.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
- Github. Github copilot - your ai pair programmer, 2022. URL https://github.com/features/copilot.
- How do analysts understand and verify ai-assisted data analyses? arXiv preprint arXiv:2309.10947, 2023.
- Sandra G Hart. Nasa-task load index (nasa-tlx); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, volume 50, pages 904–908. Sage publications Sage CA: Los Angeles, CA, 2006.
- Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
- Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment. arXiv preprint arXiv:2309.14049, 2023.
- xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. arXiv preprint arXiv:2303.03004, 2023.
- Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
- A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. arXiv preprint arXiv:2305.18486, 2023.
- Can gpt-4 replicate empirical software engineering research? arXiv preprint arXiv:2310.01727, 2023.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
- CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=6lE4dQXaUcb.
- Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, 2022.
- Reading between the lines: Modeling user behavior and costs in ai-assisted programming. arXiv preprint arXiv:2210.14306, 2022.
- Simulating iterative human-ai interaction in programming with llms. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
- Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- OpenAI. Chatgpt: Optimizing language models for dialogue, 2022a. URL https://openai.com/blog/chatgpt/.
- OpenAI. Chatgpt: Introducing chatgpt. https://openai.com/blog/chatgpt, 2022b.
- The impact of ai on developer productivity: Evidence from github copilot. arXiv preprint arXiv:2302.06590, 2023.
- “it’s weird that it knows what i want”: Usability and interactions with copilot for novice programmers. ACM Trans. Comput.-Hum. Interact., 31(1), nov 2023. ISSN 1073-0516. doi: 10.1145/3617367. URL https://doi.org/10.1145/3617367.
- replit. Meet ghostwriter, your partner in code., 2023. URL https://replit.com/site/ghostwriter.
- The programmer’s assistant: Conversational interaction with a large language model for software development. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pages 491–514, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653, 2023.
- Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts, pages 1–7, 2022.
- Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264, 2022.
- Is ai the better programming partner? human-human pair programming vs. human-ai pair programming. arXiv preprint arXiv:2306.05153, 2023.
- Devgpt: Studying developer-chatgpt conversations. arXiv preprint arXiv:2309.03914, 2023.
- Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation. arXiv preprint arXiv:2311.08588, 2023.
- Intercode: Standardizing and benchmarking interactive coding with execution feedback. arXiv preprint arXiv:2306.14898, 2023.
- Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Xlcost: A benchmark dataset for cross-lingual code intelligence, 2022. URL https://arxiv.org/abs/2206.08474.
- Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 21–29, 2022.