Emergent Mind

Abstract

While current automated essay scoring (AES) methods show high agreement with human raters, their scoring mechanisms are not fully explored. Our proposed method, using counterfactual intervention assisted by LLMs, reveals that when scoring essays, BERT-like models primarily focus on sentence-level features, while LLMs are attuned to conventions, language complexity, as well as organization, indicating a more comprehensive alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions during feedback. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions. The codes and data will be released at GitHub.

GPT-3.5 SFT model performance on two datasets across various training data sizes, including zero-shot learning.

Overview

  • The paper introduces a novel diagnostic approach using linguistically-informed counterfactual interventions to better understand the decision-making mechanisms of Automated Essay Scoring (AES) systems.

  • A robust methodology integrates linguistic knowledge from essay scoring rubrics with LLMs to generate counterfactual essays, revealing model scoring bases beyond human agreement.

  • Experimental results show that LLMs not only align better with scoring rubrics but also provide high-quality feedback, enhancing the transparency and accountability of AES systems.

Diagnosing the Rationale Alignment in Automated Essay Scoring Methods

Overview

The paper "Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals" addresses a crucial aspect of automated essay scoring (AES) systems. While AES models have shown high alignment with human raters, their decision-making mechanisms remain inadequately explained. This work introduces a novel diagnostic approach using linguistically-informed counterfactual interventions to probe these mechanisms in both traditional NLP models and LLMs.

Key Contributions

The authors present a robust methodology that integrates linguistic knowledge from essay scoring rubrics—such as conventions, language complexity, and organization—with LLMs to generate counterfactual interventions. This approach systematically reveals the models' scoring basis beyond mere agreement with human raters.

Methodology

The study involves several detailed steps:

  1. Concept Extraction: Target linguistic concepts are identified from essay scoring rubrics of major standardized tests including IELTS, TOEFL iBT, and others. The focus is placed on:
  • Conventions: Adherence to standard English rules.
  • Language Complexity: Vocabulary richness and syntactic variety.
  • Organization: Logical structure and coherence.
  1. Counterfactual Generation: Using both LLMs and rule-based techniques, counterfactual essays are generated by altering specific linguistic features while preserving content and fluency.

  2. Model Evaluation: The authors fine-tune BERT, RoBERTa, and DeBERTa models on specific datasets (TOEFL11 and ELLIPSE), and compare their performance with LLMs like GPT-3.5 and GPT-4 in zero-shot and few-shot learning settings.

Experiments and Results

The experimental results provide several insights:

Agreement and Alignment:

BERT-like models exhibit higher agreement with human raters but display limitations in recognizing organizational features of essays. In contrast, LLMs, particularly after few-shot learning or fine-tuning, not only align better with scoring rubrics but also achieve high score agreement.

Counterfactual Interventions:

The study demonstrates that traditional models respond to conventions and language complexity but fail to account for logical structure and coherence. LLMs show sensitivity to all targeted linguistic concepts, indicating a more comprehensive rationale alignment.

Feedback Generation:

LLMs are employed to generate feedback for essays, which further supports their adherence to the scoring rubrics. The quality of feedback is manually evaluated, and LLMs show discernible differences between feedback for original and counterfactual essays.

Implications and Future Work

This research underscores the importance of assessing both agreement and rationale alignment in AES systems. The findings suggest that while BERT-like models may rank higher on traditional agreement metrics, LLMs offer superior alignment with human rationale when properly fine-tuned.

The implications of this study are significant for the development and deployment of AES systems in educational settings. By ensuring that models not only agree with human raters but also follow a similar rationale, we can enhance their reliability and transparency in high-stakes testing scenarios.

Moreover, the approach can be generalized to other domains where transparency in model-driven decisions is critical. The use of LLMs for generating counterfactual samples marks a substantial advancement in the explainability and accountability of machine learning models.

Conclusion

The study "Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals" provides a significant contribution to the field of AES. By employing linguistically-informed counterfactuals, the authors reveal important distinctions in how traditional models and LLMs process and score essays. This method enhances our understanding of model reasoning, paving the way for more transparent and accountable applications of neural AES systems. Future research could extend these findings by exploring additional scoring dimensions and evaluating comprehensive feedback mechanisms further.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.