InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models (2404.07940v3)

Published 11 Mar 2024 in cs.SE and cs.LG

Abstract: LLMs for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source at https://infi-coder.github.io/infibench and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.

References (28)

Summary

The paper presents a novel QA benchmark using 234 curated Stack Overflow questions to evaluate code LLMs.
It applies four model-free metrics—keyword matching, blank filling, unit testing, and dialogue similarity—to assess diverse aspects of model performance.
Results highlight performance gaps, the impact of instruction finetuning, and scaling law limits beyond 50B parameters.

Systematic Evaluation of Code LLMs: Insights from the InficoDER Benchmark

The continuous evolution of LLMs for programming has catalyzed significant advancements in software development, elevating their capacity to comprehend and generate code. Despite the emergence of numerous benchmarks like HumanEval and MBPP, which focus on code generation in specific programming tasks, these do not adequately encompass the broader question-answering (QA) abilities that reflect real-world coding scenarios. To fill this void, the researchers behind the InficoDER paper propose a new benchmark named InficoDER, a systematic QA benchmark tailored for evaluating code LLMs.

The cornerstone of InficoDER's evaluation framework lies in its comprehensive assessment criteria that surpass traditional benchmarks. It comprises 234 curated questions from Stack Overflow, covering 15 programming languages and diverse domains such as front-end, back-end, data science and machine learning, mobile and desktop, and IT operations. These questions are selected to reflect actual developer inquiries, thereby providing a more realistic gauge of a model's capabilities.

Benchmark Construction Process

InficoDER employs a meticulous selection methodology to ensure diversity and quality. Questions with at least three positively voted answers and an officially accepted solution from a dataset of Stack Overflow entries were retained. These initially amounted to over a million questions, from which a final curated set was derived based upon factors such as viewing frequency and relevance, leading to the finalized 234 questions.

To evaluate the LLM responses, InficoDER employs four model-free evaluation metrics: keyword matching, blank filling, unit testing, and dialogue similarity. By utilizing these diverse metrics, the benchmark evaluates models across a spectrum of tasks ranging from straightforward code interpretation to complex QA interactions.

Evaluation Findings and Highlights

InficoDER's comprehensive evaluation of more than 80 code LLMs presents several insightful findings:

Performance Disparities: GPT-4 exhibits a leading performance with a score of 70.64%, but even this advanced LLM is demonstrated to be far from flawless in the diverse and challenging QA landscape provided by InficoDER.
The Efficacy of Instruction Finetuning: The analysis underscores the enhancement brought by instruction-finetuning, particularly in models like deepseek-coder-33b-instruct, bridging gaps between base LLMs and those fine-tuned for specific tasks.
Scaling Laws and Model Size Relation: Data suggests that beyond a threshold of 50 billion parameters, improvements in LLM performance per parameter become less pronounced. This observation challenges the scaling laws that assert bigger always implies better, indicating that beyond certain limits, data quality and finetuning play a more pivotal role.
Future Predictions: The extrapolation of current scaling laws hints that achieving GPT-4 levels of performance in open-source models may require models exceeding 70B parameters, specifically fine-tuned for coding tasks.

Implications and Future Directions

InficoDER sets a precedent for future QA benchmarks by integrating practical coding scenarios directly reflected in real-world developer interactions. This approach fosters advancements in model training that prioritize qualitative data diversity over mere parameter scale. As models grow and the community continues contributing to open-source versions, InficoDER's framework stands as a crucial instrument for holistic evaluations that could prompt improvements in both proprietary and open-source models.

The open-source nature of InficoDER ensures that the benchmark evolves continuously with community input, facilitating a sustainable ecosystem for benchmarking advancements in code LLM evaluation. Researchers are encouraged to employ InficoDER in developing more robust, flexible, and capable code LLMs that can effectively mimic the nuanced human-centric task of providing precise and contextually relevant programming assistance.

Overall, InficoDER radically shifts the narrative on how to evaluate the real-world usability of LLMs by calibrating evaluation metrics to the dynamically complex needs of software developers, thus laying a strong foundation for subsequent evaluations and enhancements in this field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1778602456189952217

https://twitter.com/fly51fly/status/1779266188390109331

https://twitter.com/ComputerPapers/status/1778693421231657005

https://twitter.com/knishimae0531/status/1778942508002906295

https://twitter.com/knishimae0531/status/1779305449495650324

https://twitter.com/ComputerPapers/status/1806611997435244767