CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks (2404.00566v4)

Published 31 Mar 2024 in cs.SE and cs.CL

Abstract: To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is easily executable or has human-written tests. To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources. Specifically, we leverage a LLM to sandbox arbitrary pieces of code into evaluation examples, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries converted from code in 367 GitHub repositories taken from the Code- SearchNet dataset. To demonstrate the solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide code and data at: https://github.com/yiqingxyq/CodeBenchGen.

Citations (4)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/YiqingXXX/status/1785033603732349160

https://twitter.com/ComputerPapers/status/1775264703025234164

https://twitter.com/ComputerPapers/status/1788566980951224633

https://twitter.com/ComputerPapers/status/1784960843584569580

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks (2404.00566v4)

Summary

Related Papers

Tweets