ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code (2311.09835v5)

Published 16 Nov 2023 in cs.CL and cs.AI

Abstract: Despite LLMs like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at https://github.com/gersteinlab/ML-bench.

Citations (8)

View on Semantic Scholar

Summary

The paper introduces ML-Bench, which assesses LLMs’ ability to generate executable machine learning code leveraging open-source library functions.
It employs comprehensive evaluation metrics, including Pass@k and Parameter Hit Precision, to quantify LLM performance.
Empirical results show GPT-4 outperforms other models yet achieves only 39.73% success, highlighting challenges in practical code generation.

Evaluating LLMs on Machine Learning Tasks Using Open-source Libraries

In recent years, LLMs have made substantial progress in the field of code generation, a trend reflected in the new benchmark revealed by Liu et al. named ML-Bench. This benchmark uniquely focuses on assessing LLMs' abilities to leverage existing functions within open-source libraries to accomplish machine learning tasks, rather than merely generating code from scratch.

ML-Bench Overview

ML-Bench comprises 10,040 samples spanning 130 tasks across 14 prominent machine learning GitHub repositories. The core objective here is to evaluate how effectively LLMs can generate executable code by utilizing open-source functions, a practical necessity in real-world programming scenarios often overlooked by traditional code generation benchmarks.

One notable finding from ML-Bench is the superior performance of GPT-4 over other models such as GPT-3.5 and Claude 2, yet it only successfully accomplishes 39.73% of the benchmark tasks. This result underscores significant potential for model improvement and highlights the practical challenges faced when relying on LLMs for code completion in mixed language-code environments.

Challenges and Approach

The paper identifies the primary challenges for LLMs, such as comprehending long, interleaved language-code documents and managing complex cross-file code structures. The authors address these with ML-Agent, an innovative system built upon GPT-4. This agent autonomously navigates code bases, retrieves pertinent documentation, and generates executable code, improving on previous capabilities demonstrated by traditional LLMs.

The authors propose comprehensive and rigorous evaluation metrics, including Pass@k and Parameter Hit Precision, to quantify LLM performance in this new setting. These metrics are crucial given the emphasis on ensuring that the generated code not only compiles successfully but also aligns precisely with the specified parameter details provided in task instructions.

Empirical Findings

The paper provides detailed experimental results, confirming that GPT models outperform CodeLlama and other popular LLMs in this benchmark. However, the prevalence of hallucination errors and knowledge gaps in current models highlights the need for enhanced comprehension capabilities. Additionally, the findings reveal that providing models with Oracle Segments (relevant task-related code snippets) significantly boosts performance over full README inputs or BM25 retrieval.

Implications and Future Directions

This research has dual implications: practically, in guiding the development of more sophisticated LLMs that can robustly handle the nuances of real-world codebases, and theoretically, in shaping future AI research paradigms towards models that better mimic human ability to leverage existing code resources. The authors suggest that bridging these gaps could accelerate machine learning automation and streamline programming workflows.

Future research is likely to target the integration of even more refined retrieval mechanisms and enhanced domain-specific knowledge bases. This work paves the way for further exploration into agent-based LLMs capable of fully contextual understanding and dynamic interaction with multifaceted code environments.

In conclusion, ML-Bench represents a critical step forward in evaluating code generation capabilities, setting a challenging yet practical benchmark that foregrounds the effective utilization of existing software libraries in LLM-driven programming tasks.