Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 211 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation (2405.09939v2)

Published 16 May 2024 in cs.CL and cs.AI

Abstract: We introduce SciQAG, a novel framework for automatically generating high-quality science question-answer pairs from a large corpus of scientific literature based on LLMs. SciQAG consists of a QA generator and a QA evaluator, which work together to extract diverse and research-level questions and answers from scientific papers. Utilizing this framework, we construct a large-scale, high-quality, open-ended science QA dataset containing 188,042 QA pairs extracted from 22,743 scientific papers across 24 scientific domains. We also introduce SciQAG-24D, a new benchmark task designed to evaluate the science question-answering ability of LLMs. Extensive experiments demonstrate that fine-tuning LLMs on the SciQAG dataset significantly improves their performance on both open-ended question answering and scientific tasks. To foster research and collaboration, we make the datasets, models, and evaluation codes publicly available, contributing to the advancement of science question answering and developing more interpretable and reasoning-capable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  2. Florian Boudin. 2016. pke: An open source Python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pages 69–73, Osaka, Japan.
  3. TopicRank: Graph-based topic ranking for keyphrase extraction. In International joint conference on natural language processing (IJCNLP), pages 543–551.
  4. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 119–124, Hong Kong, China. Association for Computational Linguistics.
  5. LongLoRA: Efficient fine-tuning of long-context large language models.
  6. PaLM: Scaling language modeling with pathways.
  7. Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2748–2760, Florence, Italy. Association for Computational Linguistics.
  8. Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.
  9. Automatic question generation and answer assessment: A survey. Research and Practice in Technology Enhanced Learning, 16(1):1–15.
  10. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, Online. Association for Computational Linguistics.
  11. Towards wafer-size graphene layers by atmospheric pressure graphitization of silicon carbide. Nature Materials, 8(3):203–207.
  12. Mol-Instructions – A large-scale biomolecular instruction dataset for large language models. In 12th International Conference on Learning Representations.
  13. Alloys of platinum and early transition metals as oxygen reduction electrocatalysts. Nature Chemistry, 1(7):552–556.
  14. Mixtral of experts.
  15. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14).
  16. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
  17. Biomedical question answering: A survey of methods and datasets. In Fourth International Conference On Intelligent Computing in Data Sciences (ICDS), pages 1–8.
  18. Evaluating open-domain question answering in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.
  19. LIQUID: A framework for list question answering dataset generation. arXiv preprint arXiv:2302.01691.
  20. QASA: Advanced question answering on scientific articles. In Proceedings of the 40th International Conference on Machine Learning.
  21. How long can open-source LLMs truly promise on context length?
  22. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  23. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems.
  24. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
  25. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  26. Nikahat Mulla and Prachi Gharpure. 2023. Automatic question generation: A review of methodologies, datasets, evaluation metrics, and applications. Progress in Artificial Intelligence, 12(1):1–32.
  27. GPT-4 technical report.
  28. Information retrieval and question answering: A case study on COVID-19 scientific literature. Knowledge-Based Systems, 240:108072.
  29. Hariom A. Pandya and Brijesh S. Bhatt. 2021. Question answering survey: Directions, challenges, datasets, evaluation matrices.
  30. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  31. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems.
  32. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  33. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  34. Evaluating LLMs on document-based QA: Exact answer selection and numerical extraction using cogtale dataset.
  35. Data augmentation for intent classification with off-the-shelf large language models. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 47–57, Dublin, Ireland. Association for Computational Linguistics.
  36. ScienceQA: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301.
  37. HoneyBee: Progressive instruction finetuning of large language models for materials science. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5724–5739, Singapore. Association for Computational Linguistics.
  38. Not all metrics are guilty: Improving NLG evaluation with LLM paraphrasing.
  39. Llama 2: Open foundation and fine-tuned chat models.
  40. Automatic question answer generation using T5 and NLP. In International Conference on Sustainable Computing and Smart Systems, pages 1667–1673.
  41. Improving text embeddings with large language models.
  42. Self-Instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  43. Self-Instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  44. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  45. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark. Association for Computational Linguistics.
  46. LLM-powered data augmentation for enhanced cross-lingual performance. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 671–686, Singapore. Association for Computational Linguistics.
  47. DARWIN series: Domain specific large language models for natural science.
  48. GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  49. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations.
  50. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces the SciQAG framework, which automatically generates and evaluates scientific QA pairs from a large corpus of scholarly articles.
  • It employs GPT-4 and open-source models refined via expert feedback and iterative fine-tuning to achieve high scores on the RACAR metric.
  • Experiments show that fine-tuned models like Vicuna produce diverse, accurate QA pairs, significantly advancing the training of scientific LLMs.

SciQAG: Auto-Generating Scientific Question Answering Datasets

The paper introduces SciQAG, a framework designed for the automatic generation and evaluation of scientific Question-Answer (QA) pairs derived from scientific literature. The framework addresses the challenges posed by the increasing volume and complexity of scientific publications, providing a means to efficiently extract and assess knowledge using LLMs. By generating high-quality QA pairs at scale, SciQAG facilitates the training and evaluation of LLMs in scientific domains.

Framework Components and Functionality

SciQAG integrates three primary components: Seed QA, QA Generator, and QA Evaluator (Figure 1). The Seed QA component leverages GPT-4 to generate initial QA pairs from a subset of scientific papers, refined through domain expert feedback to optimize prompt effectiveness. The QA Generator then uses these refined prompts to fine-tune an open-source generative model, enabling the creation of QA pairs from a large corpus of scientific articles. Finally, the QA Evaluator employs another LLM to assess the generated pairs across five key dimensions. Figure 1

Figure 1: The SciQAG framework links Seed QA, QA Generator, and QA Evaluator steps to generate and evaluate a scientific QA dataset from scientific literature. The dashed line represents optional fine-tuning.

The framework supports multiple fine-tuning steps, enhancing its adaptability and performance. Seed QA can fine-tune the generator, and the generator can directly prompt LLMs without fine-tuning. The QA evaluator can filter data using RACAR scores for iterative improvement. The framework's flexibility allows for tailored implementations to suit specific research needs.

Dataset Creation and Characteristics

The authors curated a dataset of over 6 million scientific papers from the Web of Science (WoS) Core Collection, focusing on physical science disciplines such as materials science, chemistry, physics, and energy. To ensure balanced representation, they selected the 4,000 most cited papers from each of 24 WoS categories, resulting in a dataset of 96,000 papers (Figure 2). The TopicRank algorithm was then applied to extract 20 keywords per article, facilitating guided prompting for QA generation. The final output consists of 960,000 QA pairs, providing a substantial resource for training and benchmarking scientific LLMs. Figure 2

Figure 2: Distribution of 6M papers from the WoS Core Collection across 24 WoS categories selected from Chemistry, Physics, Materials Science and Energy. To ensure data balance, we obtained the most cited \num{4000} papers from each category, forming a dataset of \num{24} ×\times \num{4000} = \num{96000} papers.

Evaluation Metrics: The RACAR Framework

A key contribution of this work is the introduction of the RACAR metric, a five-dimensional evaluation framework designed to assess the quality of generated QA pairs. The dimensions include:

  • Relevance: Measures the alignment of QA pairs with the information in the source article.
  • Agnosticism: Assesses the context-independence of questions, ensuring they do not rely on specific elements like figures or tables.
  • Completeness: Evaluates whether answers comprehensively address all relevant aspects of the question.
  • Accuracy: Verifies the factual correctness of answers based on evidence from the paper.
  • Reasonableness: Checks the internal logical consistency of answers, ensuring they are free from contradictions.

GPT-4 was employed to assign scores on a scale of 1 to 3 for each dimension, with human expert evaluations used to validate the reliability of the automated scoring. Additionally, the authors analyzed the diversity of questions, coverage rate of answers, and source validation of numeric values to provide a holistic assessment of the dataset's quality.

Experimental Results and Analysis

The authors evaluated the performance of various LLMs, including GPT-3.5, Vicuna, and LongChat, in generating QA pairs, using the RACAR metric to compare their outputs. The results indicated that a fine-tuned Vicuna model outperformed other open-source models, although GPT-3.5 achieved higher scores across all dimensions (Table 1). Spearman and Pearson correlations were computed to compare GPT-4 assigned scores and expert-annotated scores (Figure 3). Figure 3

Figure 3: Spearman and Pearson correlations between GPT-4 assigned scores and expert-annotated scores.

Analysis of question diversity revealed that most question pairs had low similarity scores, with an average similarity of 0.31, indicating substantial diversity in the generated questions. Coverage rate analysis showed an average coverage of 68% across the evaluation set, demonstrating that answers effectively sourced information from various parts of the original papers. Furthermore, source validation of numeric values indicated that 96.7% of numerical data in the answers were present in the source text, highlighting the generator's accuracy.

Practical Implications and Future Directions

The SciQAG framework offers a cost-effective solution for generating large volumes of high-quality scientific QA data. The generated dataset can be used to train and evaluate LLMs for scientific tasks, reducing the need for manual annotation and enabling the development of more knowledgeable and accurate models. The broad and deep scope of questions generated by SciQAG, along with the detailed and informative answers, makes it a valuable tool for enhancing the accessibility and understanding of complex scientific information.

Future research could focus on expanding the training dataset, incorporating Retrieval-Augmented Generation (RAG) techniques to further reduce hallucinations, and exploring additional evaluation metrics to capture nuanced aspects of QA quality. The SciQAG framework represents a significant step toward automating knowledge extraction from scientific literature, with potential applications in various domains, including scientific discovery, education, and information retrieval.

Conclusion

SciQAG provides a robust, open-source framework for generating and evaluating scientific QA pairs. By fine-tuning an open-source LLM and employing GPT-4 for quality assessment, the framework achieves high scores on the RACAR metric and demonstrates superior performance compared to other generative models. The resulting dataset and evaluation methods offer valuable resources for advancing scientific LLMs and promoting knowledge discovery.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets