Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 44 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis (2403.01976v5)

Published 4 Mar 2024 in cs.CL

Abstract: Recent breakthroughs in LLMs have revolutionized scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data. In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), Comprehension (L2), and Analysis & Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine. To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, highlighting their strengths and areas for improvement. We hope this evaluation supports the ongoing development of LLM applications in scientific literature analysis. SciAssess and its resources are available at \url{https://github.com/sci-assess/SciAssess}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  2. Gemini Team Google. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023.
  3. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  4. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712, 2023.
  5. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021.
  6. Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023.
  7. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. CoRR, abs/2305.08322, 2023.
  8. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615, 2022.
  9. Challenging big-bench tasks and whether chain-of-thought can solve them. In ACL (Findings), pages 13003–13051. Association for Computational Linguistics, 2023.
  10. The impact of large language models on scientific discovery: a preliminary study using GPT-4. CoRR, abs/2311.07361, 2023.
  11. Large language models for scientific synthesis, inference and explanation. CoRR, abs/2310.07984, 2023.
  12. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  13. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  14. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  15. Scieval: A multi-level large language model evaluation benchmark for scientific research. arXiv preprint arXiv:2308.13149, 2023.
  16. Krathwohl, d. r. (2002). a revision of bloom’s taxonomy: An overview. theory into practice, 41 (4), 212- 218. 2009.
  17. Pierre Caron and T Khan. Improvement of creep strength in a nickel-base single-crystal superalloy by heat treatment. Materials Science and Engineering, 61(2):173–184, 1983.
  18. Prediction of reversible α𝛼\alphaitalic_α/γ𝛾\gammaitalic_γ phase transformation in multi-pass weld of fe-cr-ni ternary alloy by phase-field method. Journal of Advanced Joining Processes, 4:100067, 2021.
  19. Physical characterization of sintered nimnga ferromagnetic shape memory alloy. Materials, 13(21):4806, 2020.
  20. Evaluation of hardening and softening behaviors in zn–21al–2cu alloy processed by equal channel angular pressing. Journal of Materials Research and Technology, 6(4):329–333, 2017.
Citations (13)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com