Emergent Mind

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

(2403.16977)

Published Mar 25, 2024 in cs.CL

Abstract

This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed $n = 300$ data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = $2.482 \times 10^{-10}$). Prompt engineering significantly improved scores for both GPT-4 (p = $1.661 \times 10^{-4}$) and GPT-3.5 (p = $4.967 \times 10^{-9}$). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from Definitely AI' toDefinitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary AI' orHuman' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.

Percentage scores across six categories for student, GPT-4, mixed, and GPT-3.5 submissions, ranked.

Overview

This study investigates the performance of GPT-3.5 and GPT-4 compared to university students in coding assignments from a physics curriculum, focusing on Python programming.
It utilizes a blinded marking system to evaluate and compare the code quality between AI-generated and student submissions, including the impact of prompt engineering on AI performance.
The findings show a significant performance gap between students and AI, with prompt engineering beneficial for AI, but mixed human-AI efforts underperforming compared to individual contributions.
The research highlights the potential and current limitations of LLMs in educational contexts, suggesting the need for further exploration into AI's role in academic assessments and integrity.

A Comparative Analysis of GPT-3.5, GPT-4, and Student Performance in University-Level Coding Assessments

Introduction

This study embarks on a scrutinized examination of the performance disparity between advanced LLMs, specifically GPT-3.5 and GPT-4, and university students within the context of coding assignments. These assignments form a crucial component of a physics curriculum at Durham University, emphasizing Python programming. The inquiry explore the potential of LLMs, both raw and with prompt engineering, against student outputs and a mixed category comprising student and GPT-4 contributions. Results from this comparative analysis could illuminate the evolving capabilities of AI in educational settings, offering insights into the utility, integrity, and future direction of coding assessments.

Methodological Overview

In assessing the effectiveness of LLMs in university-level coding tasks, this study employed a blinded marking mechanism to evaluate code produced by students and AI. The emphasis on producing clear, well-labeled plots for elucidating physics scenarios underlines the differentiation of this study from prior research focusing on AI's coding puzzle-solving abilities. The coursework evaluated comes from the Laboratory Skills and Electronics module, with submissions from 103 consenting students forming a comparative base against 50 AI-generated outputs across different categories. The AI submissions were processed through a minimal and a prompt engineering-enhanced pathway to discern potential performance variations attributable to these methods.

Analysis Dimensions

The nuanced approach to comparing human and AI-generated coding submissions involved:

Prompt Engineering: Evaluating its influence on AI performance.
Authorship Identification: Analysis of markers' capability to discern the origin of submissions (AI vs. Human).
Score Comparison: A statistical examination of performance across various submission categories.

Key Findings

Score Disparity and the Impact of Prompt Engineering

The study unveiled a statistically significant performance gap between student submissions, which averaged 91.9%, and the highest-performing AI category (GPT-4 with prompt engineering) at 81.1%. Remarkably, prompt engineering consistently improved AI performance across both GPT-3.5 and GPT-4 models, reinforcing its significance in enhancing LLM output quality. However, mixed submissions, integrating student and GPT-4 efforts, unexpectedly underperformed in comparison to solely AI or student submissions, underscoring the complexity and challenges in merging AI with human efforts.

Efficacy in Distinguishing Between Human and AI Submissions

Blind markers tasked with identifying the authorship of submissions accurately pinpointed human-authored work in 92.1% of cases. This high rate of accuracy, averaging at 85.3% across a simplified binary categorization (AI vs. Human), demonstrates an intriguing capability to detect the nuanced differences between AI and human outputs, particularly within the realm of coding assignments in physics.

Implications of the Research

The discernible difference in quality between student and AI submissions, alongside the successful identification by markers, suggests that while AI can closely simulate human work quality, subtle distinctions remain detectable. These findings may have immediate applications in the academic sector, especially concerning academic integrity and the customizable implementation of AI as an educational tool. More broadly, they signal a need for continued investigation into the integration of AI in educational settings to maximize benefits while mitigating potential drawbacks.

Future Directions

As AI continues to evolve, so too will its potential impact on educational practices and assessments. This study's insights into the current limitations and capabilities of LLMs in coding assessments could guide the future development of curricula that harmonize traditional educational objectives with the innovative possibilities presented by AI. It also raises pivotal questions about the nature of learning and assessment, encouraging a reassessment of what skills and knowledge are prioritized and evaluated in our rapidly changing digital landscape.

Conclusion

In summary, this analysis offers a timely exploration into the intersections of AI capabilities with university-level education, specifically within the coding discipline of a physics degree. While AI, particularly GPT-4 with prompt engineering, demonstrates considerable prowess in approaching the task quality of university students, distinct differences, especially in creative output aspects, remain. These nuances not only highlight the current state of AI in educational settings but also chart a course for integrating these technologies in a manner that enhances learning outcomes and academic integrity. As such, the study functions as both a benchmark for current AI performance in education and a beacon for future explorations into this dynamic field.

Create an account to read this summary for free:

YouTube

HackerNews

A Comparison of Human, GPT-3.5, and GPT-4 Performance in a Coding Course (2 points, 0 comments)

https://twitter.com/deliprao/status/1772788031327523082

https://twitter.com/omarsar0/status/1772647466820685895

https://twitter.com/fly51fly/status/1774555273103036540

https://twitter.com/SchoeneggerPhil/status/1772927645149265942

https://twitter.com/Bluechip_AI/status/1772923401210134706