Papers
Topics
Authors
Recent
2000 character limit reached

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation (2404.15845v1)

Published 24 Apr 2024 in cs.CL

Abstract: Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. LLMs have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Aniket Ajit Tambe and Manasi Kulkarni. 2022. Automated essay scoring system with grammar score analysis. In 2022 Smart Technologies, Communication and Robotics (STCR), pages 1–7.
  2. Automatic text scoring using neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 715–725, Berlin, Germany. Association for Computational Linguistics.
  3. Error syntax aware augmentation of feedback comment generation dataset. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 37–44, Prague, Czechia. Association for Computational Linguistics.
  4. The effects of school-based writing-to-learn interventions on academic achievement: A meta-analysis. Review of Educational Research, 74(1):29–58.
  5. Sentence-level feedback generation for English language learners: Does data augmentation help? In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 53–59, Prague, Czechia. Association for Computational Linguistics.
  6. Toefl11: A corpus of non-native english. ETS Research Report Series, 2013(2):i–15.
  7. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793–805, Vancouver, Canada. Association for Computational Linguistics.
  8. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  9. Automated essay scoring with string kernels and word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 503–509, Melbourne, Australia. Association for Computational Linguistics.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  11. Thomas Eckes. 2015. Introduction to Many-Facet Rasch Measurement. Peter Lang Verlag, Berlin, Deutschland.
  12. Neural automated essay scoring and coherence modeling for adversarially crafted input. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 263–271, New Orleans, Louisiana. Association for Computational Linguistics.
  13. Tri Febriani. 2022. “Writing is challenging”: factors contributing to undergraduate students’ difficulties in writing English essays. Erudita: Journal of English Language Teaching, 2:83–93.
  14. Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 252–263, Florence, Italy. Association for Computational Linguistics.
  15. The hewlett foundation: Automated essay scoring.
  16. Fabric: Automated scoring and feedback generation for essays.
  17. Exploring methods for generating feedback comments for writing learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9719–9730, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  18. John Hattie and Helen Timperley. 2007. The power of feedback. Review of Educational Research, 77(1):81–112.
  19. A trait-based deep learning automated essay scoring system with adaptive feedback. International Journal of Advanced Computer Science and Applications, 11(5).
  20. Retrieval, masking, and generation: Feedback comment generation using masked comment examples. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 60–67, Prague, Czechia. Association for Computational Linguistics.
  21. Grammar error correction using pseudo-error sentences and domain adaptation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 388–392, Jeju Island, Korea. Association for Computational Linguistics.
  22. Mistral 7b.
  23. Feedback comment generation using predicted grammatical terms. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 79–83, Prague, Czechia. Association for Computational Linguistics.
  24. Review of feedback in automated essay scoring.
  25. Noor Lide Abu Kassim. 2011. Judging behaviour and rater errors: an application of the many-facet rasch model. GEMA Online™ Journal of Language Studies, 179.
  26. Zixuan Ke and Vincent Ng. 2019. Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6300–6308. International Joint Conferences on Artificial Intelligence Organization.
  27. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  28. The Tokyo tech and AIST system at the GenChal 2022 shared task on feedback comment generation. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 74–78, Prague, Czechia. Association for Computational Linguistics.
  29. Vivekanandan Kumar and David Boulanger. 2020. Explainable automated essay scoring: Deep learning really has pedagogical value. Frontiers in Education, 5.
  30. The language of prompting: What linguistic properties make a prompt successful? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9210–9232, Singapore. Association for Computational Linguistics.
  31. Coherence-based automated essay scoring using self-attention. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pages 386–397, Cham. Springer International Publishing.
  32. Assessing critical thinking in higher education: Current state and directions for next-generation assessment. ETS Research Report Series, 2014(1):1–23.
  33. Sandeep Mathias and Pushpak Bhattacharyya. 2020. Can neural networks automatically score essay traits? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 85–91, Seattle, WA, USA → Online. Association for Computational Linguistics.
  34. Using llms to bring evidence-based feedback into the classroom: Ai-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Computers and Education: Artificial Intelligence, 6:100199.
  35. Atsushi Mizumoto and Masaki Eguchi. 2023. Exploring the potential of using an ai language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2):100050.
  36. Ryo Nagata. 2019. Toward a task of feedback comment generation for writing learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3206–3215, Hong Kong, China. Association for Computational Linguistics.
  37. OpenAI. 2023. ChatGPT (GPT version: 3.5). Large language model.
  38. John Peloghitis. 2017. Difficulties and strategies in argumentative writing: A qualitative analysis. In Transformation in language education, Tokyo. JALT.
  39. Modeling organization in student essays. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 229–239, Cambridge, MA. Association for Computational Linguistics.
  40. Incorporating coherence of topics as a criterion in automatic response-to-text assessment of the organization of writing. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 20–30, Denver, Colorado. Association for Computational Linguistics.
  41. Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review, 55(3):2495–2527.
  42. Jessica Riddell. 2015. Performance, feedback, and revision: Metacognitive approaches to undergraduate essay writing. Collected Essays on Learning and Teaching, 8:79.
  43. Alla Rozovskaya and Dan Roth. 2019. Grammar error correction in morphologically rich languages: The case of Russian. Transactions of the Association for Computational Linguistics, 7:1–17.
  44. Rebecca Schendel and Andrew Tolmie. 2016. Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in rwanda. Assessment & Evaluation in Higher Education, 42(5):673–689.
  45. Valerie J. Shute. 2008. Focus on formative feedback. Review of Educational Research, 78(1):153–189.
  46. Gee! grammar error explanation with large language models.
  47. Maja Stahl and Henning Wachsmuth. 2023. Identifying feedback types to augment feedback comment generation. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 31–36, Prague, Czechia. Association for Computational Linguistics.
  48. Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas. Association for Computational Linguistics.
  49. Aesprompt: Self-supervised constraints for automated essay scoring with prompt tuning. In The 34th International Conference on Software Engineering and Knowledge Engineering, SEKE 2022, KSIR Virtual Conference Center, USA, July 1 - July 10, 2022, pages 335–340. KSI Research Inc.
  50. Llama 2: Open foundation and fine-tuned chat models.
  51. Masaki Uto. 2021. A review of deep-neural automated essay scoring models. Behaviormetrika, 48(2):459–484.
  52. Neural automated essay scoring incorporating handcrafted features. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6077–6088, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  53. Sowmya Vajjala. 2018. Automated assessment of non-native learner essays: Investigating the role of linguistic features. International Journal of Artificial Intelligence in Education, 28(1):79–105.
  54. Effects of feedback in a computer-based learning environment on students’ learning outcomes: A meta-analysis. Review of Educational Research, 85(4):475–511.
  55. Aggregating multiple heuristic signals as supervision for unsupervised automated essay scoring. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13999–14013, Toronto, Canada. Association for Computational Linguistics.
  56. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  57. A prompt pattern catalog to enhance prompt engineering with chatgpt.
  58. Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2724–2733, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  59. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569, Online. Association for Computational Linguistics.
  60. The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems.
Citations (16)

Summary

  • The paper demonstrates that integrating essay scoring with feedback generation using LLMs improves scoring accuracy.
  • The methodology compares zero-shot and few-shot prompting techniques, including persona-based strategies, to optimize performance.
  • Findings indicate that while AES benefits from coupled tasks, enhancing feedback quality remains a key area for future research.

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

Introduction

The paper "Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation" (2404.15845) investigates various prompting techniques for generating essay feedback using LLMs, while simultaneously assessing automated essay scoring (AES) performance. The authors aim to provide individualized feedback that facilitates student improvement in essay writing, mitigating the laborious task of manual evaluation. The study leverages LLMs' prowess in generating coherent, contextually relevant text to explore AES and feedback generation.

Methodology

The study examines several prompting strategies, distinguishing between zero-shot and few-shot learning paradigms, inspired by the Chain-of-Thought prompting technique. Various experimental settings are designed to test the synergy between AES and feedback generation. Prompts are crafted incorporating different task instruction types—some purely scoring, some generating feedback, and others combining both using diverse persona-based strategies. These variations aim to optimize both the quality of essay scoring and the generated feedback.

Results

Experimentation reveals nuanced insights into the effectiveness of different prompting strategies. The results highlight that coupling AES with feedback generation generally improves AES performance, albeit with limited influence on feedback quality itself. LLMs demonstrated competitive AES accuracy when tasked jointly with feedback generation, suggesting the potential for integrated approaches in educational settings.

The manual evaluations underscore the helpful nature of the generated feedback, deemed beneficial in guiding students' essay improvements. However, the authors observe that the scoring process's enhancement does not significantly elevate the feedback generation quality, indicating a potential area for methodological refinement.

Discussion

This research accentuates the complementary relationship between AES and feedback generation utilizing LLM prompting strategies. While AES benefits are pronounced when coupled with feedback generation tasks, feedback quality gains are minimal. Future inquiries might explore augmenting the feedback mechanism specifically, targeting areas like contextual depth and actionable insights, which could enhance educational applicability.

The findings provoke considerations on LLM application for educational tools: bridging automated scoring with pedagogically sound feedback might redefine AI-driven student engagement, offering personalized, scalable educational support.

Conclusion

The investigation illuminates plausible avenues for advancing automated essay evaluation through strategic LLM prompting, balancing scoring precision with meaningful feedback. The synergy observed opens prospects for developing robust AI frameworks supporting educational assessment, though refinement is needed to maximize feedback's instructional value. The authors encourage ongoing innovation in LLM applications, ensuring alignment with evolving pedagogical demands.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.