- The paper presents an innovative multi-dimensional scoring system for essays by combining fine-tuning and regression techniques.
- It fine-tunes models like RoBERTa and DistilBERT to provide detailed evaluations across dimensions such as vocabulary, grammar, and coherence.
- Experimental results demonstrate robust performance across diverse datasets, enhancing the reliability of automatic essay scoring.
Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression
Introduction
The paper presents a novel approach to Automated Essay Scoring (AES) that advances traditional methodologies by integrating fine-tuning of pre-trained models and employing multiple regression techniques. Traditional AES systems focus on providing a singular holistic score, which does not address the multidimensional feedback needs of Language 2 (L2) learners and educators. By developing a system termed "Automatic Essay Multi-dimensional Scoring" (AEMS), the research targets comprehensive scoring across various essay dimensions such as vocabulary, grammar, and coherence.
Methodology
The primary methodology involves fine-tuning existing LLMs for multidimensional essay scoring. The research applied a two-pronged approach using a combination of classification and regression techniques:
- Fine-tuning Pre-trained LLMs: The authors adapted models such as RoBERTa and DistilBERT, leveraging their classification prowess to achieve domain-specific scoring by fine-tuning them on essay datasets.
- Multiple Regression Head: A regression head was added to BERT-based classifiers to predict continuous scores for essay dimensions. The model architecture facilitated simultaneous multi-class classification and regression tasks.
- Contrastive Learning: By incorporating additional information such as essay requirements and prompts, the models improved their contextual understanding and scoring accuracy.
- Datasets: Two datasets were utilized:
- The English Language Learner Insight, Proficiency, and Skills Evaluation (ELLIPSE) Corpus.
- The International English Language Testing System (IELTS) dataset. This selection ensured diversity in the training data, enhancing the robustness and applicability of the AES models.
Experimental Results
The research conducted two major studies:
- Study on ELLIPSE Corpus: RoBERTa and DistilBERT models were fine-tuned and exhibited high performance with scores (Precision, F1, QWK) above 0.80 across all evaluated dimensions such as cohesion, syntax, and vocabulary.
- Study on IELTS Dataset: Similar fine-tuning strategies led to improved predictive accuracy for essay dimensions like task achievement and coherence. The results demonstrated slightly higher performance metrics compared to the ELLIPSE dataset, likely due to the larger size and varied nature of the IELTS corpus.
Discussion
The presented AEMS models not only surpass prior methods in holistic scoring (as evidenced by QWK comparisons) but also provide detailed dimensional scoring. The cross-dataset validation indicates stable and reliable performance across diverse conditions, showcasing the models' robustness. The strategic application of both classification and regression provides comprehensive insights into student essays, catering to the nuanced demands of educational assessment.
Conclusion
The development of AEMS systems marks a significant advance in AES by expanding beyond a simplistic holistic approach to a rich, multi-dimensional evaluation of essay quality. Future research could further enhance these models by integrating newer datasets and refining contrastive learning techniques to increase scoring precision and reliability in varied language contexts. This research thus provides an important foundation for more nuanced and actionable essay assessments, meeting the diverse requirements of educators and learners in applied linguistic environments.