On the Evaluation of Large Language Models in Unit Test Generation

Published 26 Jun 2024 in cs.SE | (2406.18181v2)

Abstract: Unit testing is an essential activity in software development for verifying the correctness of software components. However, manually writing unit tests is challenging and time-consuming. The emergence of LLMs offers a new direction for automating unit test generation. Existing research primarily focuses on closed-source LLMs (e.g., ChatGPT and CodeX) with fixed prompting strategies, leaving the capabilities of advanced open-source LLMs with various prompting settings unexplored. Particularly, open-source LLMs offer advantages in data privacy protection and have demonstrated superior performance in some tasks. Moreover, effective prompting is crucial for maximizing LLMs' capabilities. In this paper, we conduct the first empirical study to fill this gap, based on 17 Java projects, five widely-used open-source LLMs with different structures and parameter sizes, and comprehensive evaluation metrics. Our findings highlight the significant influence of various prompt factors, show the performance of open-source LLMs compared to the commercial GPT-4 and the traditional Evosuite, and identify limitations in LLM-based unit test generation. We then derive a series of implications from our study to guide future research and practical use of LLM-based unit test generation.

Abstract PDF HTML Upgrade to Chat

Authors (11)

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that prompt design significantly influences LLM performance in generating tests with improved coverage metrics.
The paper finds that larger models like PD-34B achieve higher test coverage than smaller ones, though advanced models still lag behind traditional methods due to syntactic issues.
The paper reveals that in-context learning techniques, including Chain-of-Thought, do not consistently enhance defect detection, suggesting the need for refined prompt and data strategies.

An Empirical Study of Unit Test Generation with LLMs

The paper entitled "An Empirical Study of Unit Test Generation with LLMs" provides a comprehensive evaluation of the effectiveness of open-source LLMs in the context of unit test generation. The authors focus their investigation on open-source LLMs, diverging from previous studies predominantly centered on closed-source models like GPT-3.5, GPT-4, and CodeX, which are often associated with privacy concerns and costs due to their commercial nature.

Methodology and Research Design

The study employs 17 Java projects from the Defects4J 2.0 benchmark to investigate the performance of five open-source LLMs in generating unit tests, with model scales ranging from 7 billion to 34 billion parameters. These models include variants from CodeLlama and DeepSeek-Coder structures. The research is methodically structured around four key research questions evaluating the impact of prompt design, the relative performance of open-source LLMs against state-of-the-art models and traditional Evosuite methods, the effectiveness of in-context learning (ICL) methods, and the defect detection ability of the generated tests.

The authors employed metrics such as syntactic validity, test coverage (both line and branch), and the number of detected defects (NDD) to evaluate LLM-generated tests. They utilized approximately 3000 NVIDIA A100 GPU-hours for the experiments, underscoring the intensity of this empirical study.

Key Findings and Discussions

Prompt Design: The study reveals that prompt design significantly impacts LLM effectiveness in unit test generation. The description style and selected code features (e.g., method parameters, class fields) in prompts are crucial. For some models, designing prompts in a natural language that aligns with their training data yields superior outcomes. Properly balancing the volume of the prompt against the potential length of the LLM's output can optimize the number of generated tests, thereby improving test coverage.
Comparative Performance: The findings indicate a discrepancy in performance among open-source LLMs, with CodeLlama and DeepSeek-Coder models exhibiting diverse effectiveness. Larger models like PD-34B and DC-33B generally demonstrate higher test coverage compared to smaller ones. Despite improvements, however, all LLM-based approaches, including the advanced GPT-4, underperform traditional Evosuite in coverage metrics due to high rates of syntactically invalid tests—a consequence of LLMs hallucinating during code generation.
In-Context Learning (ICL) Methods: The study shows that ICL methods, such as Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG), do not consistently enhance unit test generation effectiveness. The CoT method only improved performance in models with strong code comprehension capabilities, while the RAG method was ineffective due to significant mismatches between retrieved and LLM-generated unit tests.
Defect Detection: The defect detection ability of LLMs is limited. Major limitations include the low validity of generated tests, missing specific inputs needed to trigger defects, and unsuitable assertions. The paper suggests augmenting test inputs via mutation strategies may improve defect detection.

Implications for Future Research

The implications of this work are profound for both theoretical development and practical application in automated software testing. The study emphasizes the need for further research into optimizing prompt strategies tailored to the characteristics of individual LLMs and revisiting the architectural specifics of code comprehension in LLMs. Moreover, addressing hallucination issues through post-processing strategies could significantly enhance the utility of LLM-generated tests.

The authors suggest that beyond prompt refinement, enriching the training data specific to unit test generation and possibly employing task-focused supervised fine-tuning (SFT) might fundamentally boost the effectiveness of open-source LLMs. Such endeavors would complement the high-quality retrieval databases anticipated to refine ICL methods like RAG for use in software engineering contexts.

This paper contributes valuable insights into the capabilities and limitations of existing open-source LLMs and underscores the need for tailored, innovative approaches to fully leverage LLMs in unit test generation, guiding future research trajectories and practical implementations within AI-driven software engineering.

Markdown Report Issue