Emergent Mind

Evaluating Language Models for Generating and Judging Programming Feedback

(2407.04873)
Published Jul 5, 2024 in cs.AI and cs.CY

Abstract

The emergence of LLMs has transformed research and practice in a wide range of domains. Within the computing education research (CER) domain, LLMs have received plenty of attention especially in the context of learning programming. Much of the work on LLMs in CER has however focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments, and in judging the quality of the programming feedback, contrasting the results against proprietary models. Our evaluations on a dataset of students' submissions to Python introductory programming exercises suggest that the state-of-the-art open-source LLMs (Meta's Llama3) are almost on-par with proprietary models (GPT-4o) in both the generation and assessment of programming feedback. We further demonstrate the efficiency of smaller LLMs in the tasks, and highlight that there are a wide range of LLMs that are accessible even for free for educators and practitioners.

Overview

  • The paper assesses the capabilities of proprietary and open-source LLMs in generating high-quality feedback for programming assignments, specifically focusing on explanations of bugs and suggested fixes.

  • It evaluates how well these models can judge the quality of feedback generated by others, comparing the performance of models like GPT-4 and open-source alternatives such as Llama3-70B.

  • The study finds that open-source models, while generally less effective than GPT-4, show strong potential, especially in educational contexts where cost and adaptability are crucial considerations.

Evaluating Language Models for Generating and Judging Programming Feedback

The research paper titled "Evaluating Language Models for Generating and Judging Programming Feedback" by Charles Koutcheme et al. focuses on assessing the capabilities of both proprietary and open-source LLMs in generating high-quality feedback for programming assignments. Furthermore, the study evaluates the ability of these models to appraise the quality of the feedback generated by others, positioning open-source models as viable alternatives to proprietary models such as GPT-4.

Objectives and Context

The core objectives of the paper are twofold:

  1. Generation: To evaluate and compare the quality of feedback generated by state-of-the-art open-source and proprietary LLMs, specifically focusing on explanations of bugs and suggested fixes in student programming assignments.
  2. Judging: To assess the extent to which these models can also evaluate the quality of programming feedback produced by other models in comparison to expert human judgment.

The context for this research is rooted in the computing education domain, where providing detailed and timely feedback on programming assignments is crucial but often resource-intensive. With the advent of powerful LLMs, there is a growing interest in automating not just the generation of feedback but also its assessment, thereby alleviating the cognitive load on educators.

Methodology

To address the stated objectives, the authors used a publicly available benchmark dataset containing student submissions for Python programming exercises. The dataset includes incorrect student solutions, ground truth descriptions of bugs, required fixes, and corresponding unit tests.

For feedback generation:

  • Five state-of-the-art open-source models were used: Gemma-2B, Phi-3-mini, Mistral-7B, Llama3-8B, and Llama3-70B.
  • Two proprietary models, GPT-3.5-turbo and GPT-4o, were evaluated for baseline comparison.

Each model was prompted to provide explanations of bugs and suggested fixes. The generated outputs were then manually annotated using criteria that included completeness, selectivity, and clarity, followed by an evaluation of the correctness of repair suggestions.

For feedback evaluation:

  • The ability of LLMs to judge the feedback quality was assessed using two scenarios:
  • Without reference answers: Models generate their bug descriptions before evaluating another model’s output.
  • With reference answers: Models are provided with ground truth bug descriptions for evaluation.

Results

Feedback Generation:

  • Proprietary Model Performance: GPT-4o exhibited superior performance across nearly all criteria. GPT-3.5-turbo was also strong but occasionally surpassed by Llama3-70B in some metrics.
  • Open-Source Model Performance: Notably, Llama3-70B demonstrated performance competitive to GPT-3.5-turbo, especially in generating correct explanations and accurate fixes. In contrast, smaller models like Gemma-2B showed weaker performance.

The inclusion of comprehensible and accurate repair suggestions highlighted the strengths and potential areas for improvement among the models. The selective identification of relevant issues presented a notable challenge across all models.

Feedback Judgement:

  • Judging Performance: Proprietary models, particularly GPT-4o, excelled in evaluating feedback quality. Providing ground truth bug descriptions (GAG scenario) significantly improved the performance of most models, with Llama3-70B outperforming GPT-4o in criteria of explanation completeness and fix accuracy.
  • Model Agreement: Low kappa scores in the SAG scenario indicated that most models tended to be overly positive, with improved but still moderate reliability when ground truth was available.
  • Ensemble Approach: Interestingly, an ensemble of models did not outperform individual strong models, suggesting biases and the need for further refinement in ensemble methods.

Implications and Future Work

The findings underscore the potential of open-source models in educational contexts, highlighting models like Llama3-70B as strong contenders against proprietary alternatives. This has practical implications, particularly around cost, transparency, and adaptability for institutions with limited resources.

The primary limitation noted in the study is the variance in the selectivity of issues identified by the models. Addressing this through fine-tuning and reinforcement learning approaches could help improve the reliability of AI-generated feedback.

The paper outlines several directions for future research, including large-scale evaluations of open-source models across different programming languages and extending the scope to other types of feedback like next-step hints. Furthermore, the researchers plan to maintain an online leaderboard to keep track of the evolving landscape of LLM performance in educational settings.

Conclusion

This study convincingly demonstrates that open-source LLMs can be nearly as effective as proprietary models like GPT-4 in generating and assessing programming feedback. Such advancements not only democratize access to powerful AI tools but also pave the way for more scalable and efficient educational practices. The ongoing development and refinement of these models hold promise for their enhanced role in computing education and beyond.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.