Evaluating Language Models for Generating and Judging Programming Feedback (2407.04873v2)

Published 5 Jul 2024 in cs.AI and cs.CY

Abstract: The emergence of LLMs has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, LLMs have garnered significant attention, particularly in the context of learning programming. Much of the work on LLMs in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source LLMs are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller LLMs in these tasks and highlight the wide range of LLMs accessible, even for free, to educators and practitioners.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that open-source models like Llama3-70B can nearly match proprietary models in generating precise bug explanations and fixes.
The paper employs a benchmark of Python assignments to assess feedback quality with metrics for completeness, selectivity, and accuracy.
The paper finds that incorporating ground truth bug details enhances judgment accuracy, offering scalable insights for computing education.

Evaluating LLMs for Generating and Judging Programming Feedback

The research paper titled "Evaluating LLMs for Generating and Judging Programming Feedback" by Charles Koutcheme et al. focuses on assessing the capabilities of both proprietary and open-source LLMs in generating high-quality feedback for programming assignments. Furthermore, the paper evaluates the ability of these models to appraise the quality of the feedback generated by others, positioning open-source models as viable alternatives to proprietary models such as GPT-4.

Objectives and Context

The core objectives of the paper are twofold:

Generation: To evaluate and compare the quality of feedback generated by state-of-the-art open-source and proprietary LLMs, specifically focusing on explanations of bugs and suggested fixes in student programming assignments.
Judging: To assess the extent to which these models can also evaluate the quality of programming feedback produced by other models in comparison to expert human judgment.

The context for this research is rooted in the computing education domain, where providing detailed and timely feedback on programming assignments is crucial but often resource-intensive. With the advent of powerful LLMs, there is a growing interest in automating not just the generation of feedback but also its assessment, thereby alleviating the cognitive load on educators.

Methodology

To address the stated objectives, the authors used a publicly available benchmark dataset containing student submissions for Python programming exercises. The dataset includes incorrect student solutions, ground truth descriptions of bugs, required fixes, and corresponding unit tests.

For feedback generation:

Five state-of-the-art open-source models were used: Gemma-2B, Phi-3-mini, Mistral-7B, Llama3-8B, and Llama3-70B.
Two proprietary models, GPT-3.5-turbo and GPT-4o, were evaluated for baseline comparison.

Each model was prompted to provide explanations of bugs and suggested fixes. The generated outputs were then manually annotated using criteria that included completeness, selectivity, and clarity, followed by an evaluation of the correctness of repair suggestions.

For feedback evaluation:

The ability of LLMs to judge the feedback quality was assessed using two scenarios:
- Without reference answers: Models generate their bug descriptions before evaluating another model’s output.
- With reference answers: Models are provided with ground truth bug descriptions for evaluation.

Results

Feedback Generation:

Proprietary Model Performance: GPT-4o exhibited superior performance across nearly all criteria. GPT-3.5-turbo was also strong but occasionally surpassed by Llama3-70B in some metrics.
Open-Source Model Performance: Notably, Llama3-70B demonstrated performance competitive to GPT-3.5-turbo, especially in generating correct explanations and accurate fixes. In contrast, smaller models like Gemma-2B showed weaker performance.

The inclusion of comprehensible and accurate repair suggestions highlighted the strengths and potential areas for improvement among the models. The selective identification of relevant issues presented a notable challenge across all models.

Feedback Judgement:

Judging Performance: Proprietary models, particularly GPT-4o, excelled in evaluating feedback quality. Providing ground truth bug descriptions (GAG scenario) significantly improved the performance of most models, with Llama3-70B outperforming GPT-4o in criteria of explanation completeness and fix accuracy.
Model Agreement: Low kappa scores in the SAG scenario indicated that most models tended to be overly positive, with improved but still moderate reliability when ground truth was available.
Ensemble Approach: Interestingly, an ensemble of models did not outperform individual strong models, suggesting biases and the need for further refinement in ensemble methods.

Implications and Future Work

The findings underscore the potential of open-source models in educational contexts, highlighting models like Llama3-70B as strong contenders against proprietary alternatives. This has practical implications, particularly around cost, transparency, and adaptability for institutions with limited resources.

The primary limitation noted in the paper is the variance in the selectivity of issues identified by the models. Addressing this through fine-tuning and reinforcement learning approaches could help improve the reliability of AI-generated feedback.

The paper outlines several directions for future research, including large-scale evaluations of open-source models across different programming languages and extending the scope to other types of feedback like next-step hints. Furthermore, the researchers plan to maintain an online leaderboard to keep track of the evolving landscape of LLM performance in educational settings.

Conclusion

This paper convincingly demonstrates that open-source LLMs can be nearly as effective as proprietary models like GPT-4 in generating and assessing programming feedback. Such advancements not only democratize access to powerful AI tools but also pave the way for more scalable and efficient educational practices. The ongoing development and refinement of these models hold promise for their enhanced role in computing education and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/realmofresearch/status/1810538036838756622