2000 character limit reached
Evalverse: Unified and Accessible Library for Large Language Model Evaluation (2404.00943v2)
Published 1 Apr 2024 in cs.CL and cs.AI
Abstract: This paper introduces Evalverse, a novel library that streamlines the evaluation of LLMs by unifying disparate evaluation tools into a single, user-friendly framework. Evalverse enables individuals with limited knowledge of artificial intelligence to easily request LLM evaluations and receive detailed reports, facilitated by an integration with communication platforms like Slack. Thus, Evalverse serves as a powerful tool for the comprehensive assessment of LLMs, offering both researchers and practitioners a centralized and easily accessible evaluation framework. Finally, we also provide a demo video for Evalverse, showcasing its capabilities and implementation in a two-minute format.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Open llm leaderboard. Hugging Face.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
- Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- A framework for few-shot language model evaluation.
- Legalbench: Prototyping a collaborative benchmark for legal reasoning. arXiv preprint arXiv:2209.06120.
- Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
- A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
- Shashank Mohan Jain. 2022. Hugging face. In Introduction to Transformers for NLP: With the Hugging Face Library and Models to Solve Problems, pages 51–67. Springer.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- A comprehensive survey on process-oriented automatic text summarization with exploration of llm-based methods. arXiv preprint arXiv:2403.02901.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Evaluation of general large language models in contextually assessing semantic concepts extracted from adult critical care electronic health record notes. arXiv preprint arXiv:2401.13588.
- OpenAI. 2023. Gpt-4 technical report.
- Samuel J Paech. 2023. Eq-bench: An emotional intelligence benchmark for large language models. arXiv preprint arXiv:2312.06281.
- Philip Resnik and Jimmy Lin. 2010. Evaluation of nlp systems. The handbook of computational linguistics and natural language processing, pages 271–295.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. arXiv preprint arXiv:2310.04793.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Fofo: A benchmark to evaluate llms’ format-following capability. arXiv preprint arXiv:2402.18667.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
- Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36.