Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression (2406.01198v1)

Published 3 Jun 2024 in cs.CL and cs.AI

Abstract: Automated essay scoring (AES) involves predicting a score that reflects the writing quality of an essay. Most existing AES systems produce only a single overall score. However, users and L2 learners expect scores across different dimensions (e.g., vocabulary, grammar, coherence) for English essays in real-world applications. To address this need, we have developed two models that automatically score English essays across multiple dimensions by employing fine-tuning and other strategies on two large datasets. The results demonstrate that our systems achieve impressive performance in evaluation using three criteria: precision, F1 score, and Quadratic Weighted Kappa. Furthermore, our system outperforms existing methods in overall scoring.

Citations (2)

View on Semantic Scholar

Summary

The paper presents an innovative multi-dimensional scoring system for essays by combining fine-tuning and regression techniques.
It fine-tunes models like RoBERTa and DistilBERT to provide detailed evaluations across dimensions such as vocabulary, grammar, and coherence.
Experimental results demonstrate robust performance across diverse datasets, enhancing the reliability of automatic essay scoring.

Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression

Introduction

The paper presents a novel approach to Automated Essay Scoring (AES) that advances traditional methodologies by integrating fine-tuning of pre-trained models and employing multiple regression techniques. Traditional AES systems focus on providing a singular holistic score, which does not address the multidimensional feedback needs of Language 2 (L2) learners and educators. By developing a system termed "Automatic Essay Multi-dimensional Scoring" (AEMS), the research targets comprehensive scoring across various essay dimensions such as vocabulary, grammar, and coherence.

Methodology

The primary methodology involves fine-tuning existing LLMs for multidimensional essay scoring. The research applied a two-pronged approach using a combination of classification and regression techniques:

Fine-tuning Pre-trained LLMs: The authors adapted models such as RoBERTa and DistilBERT, leveraging their classification prowess to achieve domain-specific scoring by fine-tuning them on essay datasets.
Multiple Regression Head: A regression head was added to BERT-based classifiers to predict continuous scores for essay dimensions. The model architecture facilitated simultaneous multi-class classification and regression tasks.
Contrastive Learning: By incorporating additional information such as essay requirements and prompts, the models improved their contextual understanding and scoring accuracy.
Datasets: Two datasets were utilized:
- The English Language Learner Insight, Proficiency, and Skills Evaluation (ELLIPSE) Corpus.
- The International English Language Testing System (IELTS) dataset. This selection ensured diversity in the training data, enhancing the robustness and applicability of the AES models.

Experimental Results

The research conducted two major studies:

Study on ELLIPSE Corpus: RoBERTa and DistilBERT models were fine-tuned and exhibited high performance with scores (Precision, F1, QWK) above 0.80 across all evaluated dimensions such as cohesion, syntax, and vocabulary.
Study on IELTS Dataset: Similar fine-tuning strategies led to improved predictive accuracy for essay dimensions like task achievement and coherence. The results demonstrated slightly higher performance metrics compared to the ELLIPSE dataset, likely due to the larger size and varied nature of the IELTS corpus.

Discussion

The presented AEMS models not only surpass prior methods in holistic scoring (as evidenced by QWK comparisons) but also provide detailed dimensional scoring. The cross-dataset validation indicates stable and reliable performance across diverse conditions, showcasing the models' robustness. The strategic application of both classification and regression provides comprehensive insights into student essays, catering to the nuanced demands of educational assessment.

Conclusion

The development of AEMS systems marks a significant advance in AES by expanding beyond a simplistic holistic approach to a rich, multi-dimensional evaluation of essay quality. Future research could further enhance these models by integrating newer datasets and refining contrastive learning techniques to increase scoring precision and reliability in varied language contexts. This research thus provides an important foundation for more nuanced and actionable essay assessments, meeting the diverse requirements of educators and learners in applied linguistic environments.