Emergent Mind

Abstract

LLMs have been reported to outperform existing automatic evaluation metrics in some tasks, such as text summarization and machine translation. However, there has been a lack of research on LLMs as evaluators in grammatical error correction (GEC). In this study, we investigate the performance of LLMs in GEC evaluation by employing prompts designed to incorporate various evaluation criteria inspired by previous research. Our extensive experimental results demonstrate that GPT-4 achieved Kendall's rank correlation of 0.662 with human judgments, surpassing all existing methods. Furthermore, in recent GEC evaluations, we have underscored the significance of the LLMs scale and particularly emphasized the importance of fluency among evaluation criteria.

Evaluation framework employing large language models (LLMs).

Overview

  • The study investigates the use of LLMs like GPT-4 in evaluating Grammatical Error Correction (GEC), marking a novel approach in NLP.

  • GPT-4 notably outperforms existing evaluation metrics for GEC, showing a high Kendall's rank correlation with human judgments, particularly when fluency is emphasized in the evaluation criteria.

  • The research highlights the importance of LLM scale, with larger models achieving better performance in GEC evaluation, and suggests tailored prompts improve assessment accuracy.

  • It advocates for a shift towards using LLMs, especially for fluency-focused evaluations, and proposes future research directions including few-shot learning impacts and prompt engineering.

Evaluating Grammatical Error Correction Using LLMs

Introduction

The utilization of LLMs for evaluating Grammatical Error Correction (GEC) represents an emergent area of interest within NLP. While LLMs, such as GPT-4, have demonstrated remarkable performance across various tasks including text summarization and machine translation, their application in GEC evaluation has been relatively unexplored. This study presents a pioneering investigation into the efficacy of LLMs in GEC evaluation, leveraging prompts designed to capture a range of evaluation criteria. The results highlight GPT-4’s superior performance, achieving a Kendall's rank correlation of 0.662 with human judgments, thereby outperforming existing metrics. This study also draws attention to the critical role of fluency within evaluation criteria and the significance of LLM scale in performance outcomes.

Experiment Setup

Considered Metrics

Metrics for GEC evaluation can be categorized into Edit-Based Metrics (EBMs) and Sentence-Based Metrics (SBMs). EBMs focus on the edits made to correct a sentence, while SBMs assess the overall quality of the corrected sentences. This study considers various metrics within these categories, such as ERRANT and GECToR for EBMs, and GLEU and IMPARA for SBMs.

LLMs and Prompts

Three LLMs were evaluated: LLaMa 2, GPT-3.5, and GPT-4, across different prompts addressing both overall sentence quality and specific edits. The study leveraged prompts emphasizing different evaluation criteria, observing their impact on the evaluation performance.

Results

System-Level Analysis

At the system level, GPT-4 consistently demonstrated superior performance, showing high correlations with human judgments. Prompts tailored to specific evaluation criteria generally enhanced performance, suggesting that GPT-4 can extract meaningful insights from these criteria. The reduction in performance with smaller LLMs reinforced the importance of LLM scale.

Sentence-Level Analysis

At the sentence level, disparities in correlations suggested by metrics not evident in the system-level analysis were observed. GPT-4, especially with fluency-focused prompts, achieved state-of-the-art performance, thereby asserting the need to prioritize fluency in evaluating high-quality corrections.

Further Analysis

Window analysis comparing system sets of similar performance revealed GPT-4's robust evaluation capabilities, particularly emphasizing fluency's role. In contrast, conventional metrics frequently showed either no correlation or negative correlation, showcasing their limitations in GEC evaluation.

Implications and Future Work

This study underscores LLMs, especially GPT-4, as potent evaluators for grammatical error correction, surpassing traditional metrics in correlation with human judgments. The findings advocate for the importance of fluency in evaluation criteria and hint at the potential need for sentence-level correlations or evaluating systems with similar performances to overcome performance saturation in system-level meta-evaluation.

Future research avenues may explore few-shot learning impacts and refine prompt engineering for enhanced evaluation performance. Additionally, extending evaluation to document-level considerations may yield further insights, given the expansion of context windows in LLMs.

Conclusion

LLMs, particularly GPT-4, exhibit promising capabilities as evaluators in grammatical error correction, offering nuanced insights over traditional metrics. This investigation into LLM-based GEC evaluation not only highlights the paramountcy of fluency and the scale of LLMs but also sets the stage for advanced research into optimizing evaluation methods in the continually evolving landscape of NLP.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.