CodeScore: Evaluating Code Generation by Learning Code Execution

Published 22 Jan 2023 in cs.SE | (2301.09043v4)

Abstract: A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from two significant drawbacks. 1. They primarily measure the surface differences between codes without considering their functional equivalence. However, functional equivalence is pivotal in evaluating the effectiveness of code generation, as different codes can perform identical operations. 2. They are predominantly designed for the Ref-only input format. However, code evaluation necessitates versatility in input formats. Aside from Ref-only, there are NL-only and Ref&NL formats, which existing match-based CEMs cannot effectively accommodate. In this paper, we propose CodeScore, a LLM-based CEM, which estimates the functional correctness of generated code on three input types. To acquire CodeScore, we present UniCE, a unified code generation learning framework, for LLMs to learn code execution (i.e., learning PassRatio and Executability of generated code) with unified input. Extensive experimental results on multiple code evaluation datasets demonstrate that CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.

Abstract PDF HTML Upgrade to Chat

References (49)

Citations (41)

View on Semantic Scholar

Summary

The paper presents CodeScore, a novel metric that leverages LLMs and simulated execution to evaluate code's functional correctness.
It employs the UniCE framework to train models for predicting PassRatio and binary Executability across multiple input formats.
Experimental results demonstrate up to 58.87% improved correlation with functional correctness while significantly reducing computational cost.

CodeScore: Evaluating Code Generation by Learning Code Execution

The paper proposes an innovative evaluation metric for code generation, named CodeScore, which aims to overcome the limitations of traditional match-based code evaluation metrics (CEMs) that focus on surface-level differences and are restricted to specific input formats. CodeScore employs LLMs to assess functional correctness of generated code across three input types, namely Ref-only, NL-only, and Ref{content}NL. The authors introduce a unified code generation learning framework, UniCE, to train LLMs for predicting PassRatio and Executability through simulated code execution.

Motivation and Challenges

The automatic evaluation of code generation is of substantial interest within both NLP and software engineering communities. Existing match-based CEMs like BLEU and CodeBLEU primarily emphasize lexical features and fail to account for functional equivalence, an essential factor for code evaluation. Furthermore, these metrics are designed to manage only Ref-only input formats, limiting their adaptability when natural language descriptions (NL) or additional context are involved.

CodeScore and UniCE Framework

CodeScore, as described in the paper, is an LLM-based metric measuring functional correctness by evaluating execution output similarity. The UniCE framework is designed to finetune LLMs, enabling them to learn code execution with unified inputs. The model evaluates generated code based on PassRatio—the fraction of test cases passed over total cases—and binary Executability, which distinguishes between executable and non-executable code. Through multiple experiments, the approach achieved up to 58.87% better correlation with functional correctness than other CEMs.

Experimental Validation

Empirical results demonstrate CodeScore's efficacy across three constructed datasets—APPS-Eval, MBPP-Eval, and HE-Eval. Notably, CodeScore outperformed traditional metrics and LLM-based EMs, establishing strong correlation with functional correctness and reducing mean absolute error. Additionally, the paper highlights CodeScore's versatility across different input formats. Its evaluation speed is significantly enhanced, drastically lowering the computational cost compared to execution-based CEMs.

Implications and Future Directions

The study provides a pathway toward more accurate and computationally efficient code evaluation metrics. This research potentially facilitates the advancement of code generation technologies by improving feedback accuracy for model training, revolutionizing programming paradigms, and cutting development costs. Future work might expand CodeScore's capabilities to encompass broader programming scenarios and refine its efficiency further.

In conclusion, CodeScore presents a robust approach to measuring code functional correctness, addressing longstanding inefficiencies in match-based CEMs. This advancement supports more holistic and practical code evaluation, paving the path for future innovations in AI-driven coding solutions.

Markdown Report Issue