Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Harnessing the Power of Prompt-based Techniques for Generating School-Level Questions using Large Language Models (2312.01032v1)

Published 2 Dec 2023 in cs.CL and cs.AI

Abstract: Designing high-quality educational questions is a challenging and time-consuming task. In this work, we propose a novel approach that utilizes prompt-based techniques to generate descriptive and reasoning-based questions. However, current question-answering (QA) datasets are inadequate for conducting our experiments on prompt-based question generation (QG) in an educational setting. Therefore, we curate a new QG dataset called EduProbe for school-level subjects, by leveraging the rich content of NCERT textbooks. We carefully annotate this dataset as quadruples of 1) Context: a segment upon which the question is formed; 2) Long Prompt: a long textual cue for the question (i.e., a longer sequence of words or phrases, covering the main theme of the context); 3) Short Prompt: a short textual cue for the question (i.e., a condensed representation of the key information or focus of the context); 4) Question: a deep question that aligns with the context and is coherent with the prompts. We investigate several prompt-based QG methods by fine-tuning pre-trained transformer-based LLMs, namely PEGASUS, T5, MBART, and BART. Moreover, we explore the performance of two general-purpose pre-trained LLMs such as Text-Davinci-003 and GPT-3.5-Turbo without any further training. By performing automatic evaluation, we show that T5 (with long prompt) outperforms all other models, but still falls short of the human baseline. Under human evaluation criteria, TextDavinci-003 usually shows better results than other models under various prompt settings. Even in the case of human evaluation criteria, QG models mostly fall short of the human baseline. Our code and dataset are available at: https://github.com/my625/PromptQG

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification. Computational Linguistics 47, 4 (12 2021), 861–889. https://doi.org/10.1162/coli_a_00418 arXiv:https://direct.mit.edu/coli/article-pdf/47/4/861/1979827/coli_a_00418.pdf
  2. Re-evaluating Evaluation in Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 9347–9359. https://doi.org/10.18653/v1/2020.emnlp-main.751
  3. Shuyang Cao and Lu Wang. 2021. Controllable Open-ended Question Generation with A New Question Type Ontology. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6424–6439. https://doi.org/10.18653/v1/2021.acl-long.502
  4. LearningQ: A Large-Scale Dataset for Educational Question Generation. Proceedings of the International AAAI Conference on Web and Social Media 12, 1 (Jun. 2018). https://doi.org/10.1609/icwsm.v12i1.14987
  5. QuAC: Question Answering in Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2174–2184. https://doi.org/10.18653/v1/D18-1241
  6. Overhearing dialogues and monologues in virtual tutoring sessions: Effects on quesioning and vicarious learning. International Journal of Artificial Intelligence in Education 11 (2000), 242–253.
  7. Unified Language Model Pre-Training for Natural Language Understanding and Generation. Curran Associates Inc., Red Hook, NY, USA.
  8. KHANQ: A Dataset for Generating Deep Questions in Education. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 5925–5938. https://aclanthology.org/2022.coling-1.518
  9. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1601–1611. https://doi.org/10.18653/v1/P17-1147
  10. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
  11. Kalpesh Krishna and Mohit Iyyer. 2019. Generating Question-Answer Hierarchies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2321–2334. https://doi.org/10.18653/v1/P19-1224
  12. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466. https://doi.org/10.1162/tacl_a_00276
  13. J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 1 (1977), 159–174. http://www.jstor.org/stable/2529310
  14. Alon Lavie and Abhaya Agarwal. 2007. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, Prague, Czech Republic, 228–231. https://aclanthology.org/W07-0734
  15. Seungyeon Lee and Minho Lee. 2022. Type-dependent Prompt CycleQAG : Cycle Consistency for Multi-hop Question Generation. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 6301–6314. https://aclanthology.org/2022.coling-1.549
  16. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
  17. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
  18. Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics 8 (2020), 726–742. https://doi.org/10.1162/tacl_a_00343
  19. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2381–2391. https://doi.org/10.18653/v1/D18-1260
  20. Automatic Generation of Multiple-Choice Test Items from Paragraphs Using Deep Neural Networks. In Advancing Natural Language Processing in Educational Assessment. Routledge, 77–89.
  21. Nikahat Mulla and Prachi Gharpure. 2023. Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications. Progress in Artificial Intelligence 12, 1 (2023), 1–32.
  22. Semantic Graphs for Generating Deep Questions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1463–1475. https://doi.org/10.18653/v1/2020.acl-main.135
  23. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  24. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation. Association for Computational Linguistics, Lisbon, Portugal, 392–395. https://doi.org/10.18653/v1/W15-3049
  25. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  27. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 784–789. https://doi.org/10.18653/v1/P18-2124
  28. Ehud Reiter. 2018. A Structured Review of the Validity of BLEU. Computational Linguistics 44, 3 (09 2018), 393–401. https://doi.org/10.1162/coli_a_00322 arXiv:https://direct.mit.edu/coli/article-pdf/44/3/393/1809172/coli_a_00322.pdf
  29. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks. Transactions of the Association for Computational Linguistics 8 (2020), 264–280. https://doi.org/10.1162/tacl_a_00313
  30. Capturing Greater Context for Question Generation. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 9065–9072. https://doi.org/10.1609/aaai.v34i05.6440
  31. Exploring Question-Specific Rewards for Generating Deep Questions. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 2534–2546. https://doi.org/10.18653/v1/2020.coling-main.228
  32. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2369–2380. https://doi.org/10.18653/v1/D18-1259
  33. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, 11328–11339.
  34. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr
  35. Question-type Driven Question Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 6032–6037. https://doi.org/10.18653/v1/D19-1622
Citations (10)

Summary

We haven't generated a summary for this paper yet.