Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors (2306.17156v3)

Published 29 Jun 2023 in cs.CY, cs.AI, and cs.CL

Abstract: Generative AI and LLMs hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.

Citations (53)

View on Semantic Scholar

Summary

The paper presents a systematic evaluation comparing ChatGPT, GPT-4, and human tutors across six distinct programming education tasks.
It employs five introductory Python problems and custom metrics to assess capabilities in program repair, hint generation, grading feedback, pair programming, contextual explanation, and task synthesis.
Results indicate that while GPT-4 significantly improves problem-solving performance and narrows the gap with human tutors, challenges remain in complex tasks like grading feedback and task synthesis.

Overview of "Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors"

This paper presents a systematic evaluation of generative AI models, specifically OpenAI's ChatGPT (based on GPT-3.5) and GPT-4, in the context of programming education. The paper aims to benchmark these state-of-the-art models against human tutors across a diverse set of programming education scenarios. Recognizing the rapid development of AI in educational technologies, the paper highlights the necessity for a comprehensive and up-to-date benchmarking paper that compares these models' capabilities with those of experienced human tutors.

Evaluation Scenarios

The authors focus on six distinct scenarios that capture the varied roles AI can play in programming education:

Program Repair: Evaluating the models' ability to fix buggy programs with minimal changes.
Hint Generation: Assessing how effectively the models can provide hints to facilitate students' problem-solving processes.
Grading Feedback: Testing the capability of models to grade students’ programs against a defined rubric.
Pair Programming: Completing incomplete student programs, mimicking a collaborative programming environment.
Contextualized Explanation: Providing detailed explanations of specific parts of a program in context.
Task Synthesis: Generating new tasks that address specific bugs in students' code.

Methodology

The evaluation of these models is conducted using five introductory Python programming problems, with real-world buggy programs sourced from an online platform. The scenarios are assessed using customized metrics for each setting, combining both quantitative and qualitative evaluations with the help of human experts in Python programming and education.

Key Findings

Performance Comparison: GPT-4 significantly improves upon ChatGPT, edging closer to human tutor performance in several scenarios. For instance, GPT-4 outperforms ChatGPT in scenarios like program repair and hint generation, closing the performance gap with human tutors considerably.
Strengths of GPT-4: Notable advancements with GPT-4 are observed in its problem-solving capabilities, wherein it solves all posed programming tasks accurately—a task ChatGPT struggles with, particularly on the Fibonacci problem.
Challenges and Limitations: Despite advancements, GPT-4 still lags behind human tutors in complex scenarios like grading feedback and task synthesis, where nuanced understanding and detailed reasoning are crucial. In these tasks, the models tend to misjudge the difficulty and intricacies, a domain where human intuition and expertise still dominate.
Consistency Across Problems: The performance of models like GPT-4 is relatively consistent across the varied programming problems, though specific problems like Palindrome and Fibonacci reveal more significant performance gaps compared to human tutors.

Implications and Future Directions

The findings underscore AI's potential to augment programming education significantly, providing scalable, personalized educational support. However, they also highlight that in scenarios requiring deep contextual understanding and complex decision-making, current AI models like GPT-4 still require significant advancements.

Future research directions include exploring modifications and improvements to existing AI models to close these gaps further, assessing open-source AI alternatives, and expanding the scope to include different programming languages and multilingual models. Moreover, there's scope for large-scale studies involving diverse student demographics to validate these findings further.

This benchmarking paper serves as a critical step in understanding the true capabilities and limitations of modern AI in educational settings, providing a clear framework for future advancements and applications in AI-driven education technologies.

PDF Markdown

Related Papers

YouTube

Show All Videos