An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment (2403.04963v4)

Published 8 Mar 2024 in cs.CL and cs.AI

Abstract: Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs' simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's and Qwen2.5-72B's struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs.

References (47)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces an error-based human annotation framework to reliably assess sentence simplification quality by identifying specific error types.
The paper demonstrates that GPT-4 outperforms Control-T5 by reducing errors and better preserving the original meaning and fluency.
The paper highlights the limitations of traditional automatic metrics like BLEU and FKGL, underscoring the need for more refined evaluation methods.

Implementation and Evaluation of LLMs in Sentence Simplification

The paper "An In-depth Evaluation of LLMs in Sentence Simplification with Error-based Human Assessment" (2403.04963) conducts a detailed evaluation of GPT-4's performance in sentence simplification, focusing on the use of error-based human assessment to address the limitations of existing evaluation methodologies.

Introduction

The paper addresses the growing significance of sentence simplification techniques driven by advancements in LLMs like GPT-4, specifically focusing on competence in making text more accessible for individuals with reading difficulties. Previous methodologies, predominantly relying on automatic metrics or traditional human evaluations, are either too simplistic, failing to capture intricate model performance, or excessively complex, compromising annotation consistency. To mitigate these challenges, this paper proposes an error-based annotation framework that emphasizes reliability and interpretability in human assessments of GPT-4's simplification capabilities.

Methodology

Error-based Human Annotation Framework

The framework designed in this paper focuses on evaluating simplification outcomes using a set of identified error types that affect readability and comprehension. These error types include:

Lack of Simplicity-Lexical/Structural: Introducing more complex expressions or structures.
Altered Meaning-Lexical/Structural: Deviations that alter the original meaning.
Coreference Issues: Misuse of pronouns or unclear references.
Repetition and Hallucination: Unnecessary duplication or insertion of unrelated information.

This categorization is intended to align with human intuition, facilitating evaluations without requiring extensive linguistic expertise.

Meta-Evaluation of Automatic Metrics

The paper also evaluates popular automatic metrics such as SARI, BLEU, and FKGL alongside newer metrics like BERTScore and LENS. A key aim is to ascertain their effectiveness in evaluating high-quality simplifications produced by models like GPT-4.

Results

Performance of GPT-4 vs Control-T5

The paper reveals that GPT-4 consistently exhibits fewer errors compared to Control-T5 across multiple datasets, with notable proficiency in maintaining the original sentence's meaning while improving fluency and simplicity. Challenges in lexical paraphrasing remain a limitation for GPT-4. The meta-evaluation of automatic metrics shows that while SARI aligns well with human evaluations on certain datasets, BLEU and FKGL demonstrate substantial limitations, failing to provide a reliable measure of simplification quality.

Metrics Sensitivity and Evaluation

The findings indicate existing metrics have limited capability in distinguishing the detailed, high-quality outputs of advanced LLMs. They lack sensitivity in capturing intricate quality differences, particularly when evaluating state-of-the-art models like GPT-4, underlining the necessity for more refined evaluation frameworks in future research.

Conclusion

The paper concludes with insights that GPT-4 outperforms previous state-of-the-art systems in sentence simplification tasks, although there are inherent challenges in lexical paraphrasing. Additionally, the inadequacies of traditional and contemporary automatic evaluation metrics in effectively measuring simplification quality for advanced LLMs are emphasized. This research opens avenues for developing more nuanced evaluation methods that could capture the subtleties of simplification outputs generated by cutting-edge models, thus driving forward the effective deployment of AI in enhancing text accessibility. Future investigations could explore improvements to address lexical paraphrasing issues and develop more sensitive evaluation metrics tailored to LLM capabilities.