Applying Large Language Models and Chain-of-Thought for Automatic Scoring (2312.03748v2)

Published 30 Nov 2023 in cs.CL and cs.AI

Abstract: This study investigates the application of LLMs, specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment tasks (three binomial and three trinomial) with 1,650 student responses, we employed six prompt engineering strategies to automatically score student responses. The six strategies combined zero-shot or few-shot learning with CoT, either alone or alongside item stem and scoring rubrics. Results indicated that few-shot (acc = .67) outperformed zero-shot learning (acc = .60), with 12.6% increase. CoT, when used without item stem and scoring rubrics, did not significantly affect scoring accuracy (acc = .60). However, CoT prompting paired with contextual item stems and rubrics proved to be a significant contributor to scoring accuracy (13.44% increase for zero-shot; 3.7% increase for few-shot). We found a more balanced accuracy across different proficiency categories when CoT was used with a scoring rubric, highlighting the importance of domain-specific reasoning in enhancing the effectiveness of LLMs in scoring tasks. We also found that GPT-4 demonstrated superior performance over GPT -3.5 in various scoring tasks when combined with the single-call greedy sampling or ensemble voting nucleus sampling strategy, showing 8.64% difference. Particularly, the single-call greedy sampling strategy with GPT-4 outperformed other approaches.

References (52)

Citations (54)

View on Semantic Scholar

Summary

The paper introduces PPEAS, a prompt engineering framework using Chain-of-Thought to enhance scoring accuracy.
It demonstrates that few-shot learning outperforms zero-shot methods in achieving precise assessments.
The study reveals GPT-4 surpasses GPT-3.5, with a single-call strategy yielding superior domain-specific reasoning.

Introduction

The implementation of artificial intelligence in the education sector is transforming the ways in which teachers assess student learning. Automatic scoring systems, particularly within the field of science education, have gained traction as they provide immediate feedback to students, thereby significantly enhancing the learning environment. Though the potential of AI systems is clear, their adoption has been hindered by challenges such as accessibility, technical complexity, and a lack of transparency in how such systems reach their conclusions. Within this context, this research explores the application of LLMs - specifically, the capabilities of GPT-3.5 and GPT-4 - in conjunction with Chain-of-Thought (CoT) prompting to address these challenges.

Literature Review and Background

Automatic scoring of student responses has been largely based on traditional machine learning and natural language processing techniques. These methods demand substantial data collection and manual scoring by experts to train the assessment models. The advent of LLMs like BERT and SciEdBERT brought significant advances, particularly in their natural language understanding capabilities. Leveraging these pre-trained models, researchers have explored various techniques, including prompt engineering, to minimize the need for extensive training data. However, the full potential of LLMs, particularly their ability to provide domain-specific reasoning and transparent outcomes in the context of educational scoring, remains largely unexplored.

Methodology

In a novel approach, researchers crafted various prompt engineering strategies that combined zero-shot or few-shot learning with CoT prompts to facilitate domain-specific reasoning in LLMs. To test the efficacy of these strategies, a dataset comprising 1,650 student responses to science assessment tasks was employed. The paper introduces a systematic approach - Prompt Engineering for Automatic Scoring (PPEAS) - which refines the prompt generation process iteratively, integrating expertise feedback and validation. The performance between LLMs was compared under different conditions, framing the question of which models and strategies yield the best-scoring accuracy.

Findings and Implications

The paper found that few-shot learning consistently outperformed zero-shot learning, with CoT prompting when paired with rich contextual instructions and scoring rubrics significantly improving scoring accuracy. Moreover, GPT-4 exhibited superior performance over GPT-3.5. Interestingly, using a single-call strategy with GPT-4 was more effective than ensemble voting strategies, hinting at the enhanced reasoning capacity of the former. The research underscores how CoT, particularly when detailed with contextual cues, elevates the scoring precision of LLMs.

In conclusion, the integration of LLMs and CoT within automatic scoring demonstrates the potential of these models to render precise, timely, and transparent assessments. The enhanced accuracy and the propensity of the LLMs to provide domain-specific reasoning while generating interpretable scores holds promise not only for research but also for practical applications in educational settings. Thus, the adoption of LLMs could spur significant advancements in the field of education, rendering sophisticated AI tools both accessible and comprehensible for educators and learners alike.

PDF Markdown