INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (2306.04757v3)
Abstract: Instruction-tuned LLMs have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned LLMs. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance. While open-source models demonstrate impressive writing abilities, there is substantial room for improvement in problem-solving and alignment. We are encouraged by the rapid development of models by the open-source community, but we also highlight the need for rigorous evaluation to support claims made about these models. Through INSTRUCTEVAL, we aim to foster a deeper understanding of instruction-tuned models and advancements in their capabilities. INSTRUCTEVAL is publicly available at https://github.com/declare-lab/instruct-eval.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Gpt-neox-20b: An open-source autoregressive language model, 2022.
- StabilityAI. Stablelm: Stability ai language models, April 2023. URL https://github.com/Stability-AI/StableLM.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
- Pythia: A suite for analyzing large language models across training and scaling, 2023.
- Opt: Open pre-trained transformer language models, 2022.
- UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6ruVLB727MC.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.26. URL https://aclanthology.org/2022.acl-long.26.
- Rwkv: Reinventing rnns for the transformer era, 2023.
- MosaicML. Mpt-7b: A new standard for open-source, commercially usable llms, May 2023. URL https://www.mosaicml.com/blog/mpt-7b.
- Stanford alpaca: An instruction-following llama model, 2023. URL https://github.com/tatsu-lab/stanford_alpaca.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Self-instruct: Aligning language models with self-generated instructions, 2023.
- Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.244. URL https://aclanthology.org/2022.acl-long.244.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://vicuna.lmsys.org.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
- Databricks Labs. Dolly, 2023. URL https://github.com/databrickslabs/dolly.
- Openassistant conversations – democratizing large language model alignment, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- LAION-AI. Open-Assistant. https://github.com/LAION-AI/Open-Assistant, 2023.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization, 2023.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022.
- Flan-alpaca: Instruction tuning from humans and machines, March 2023. URL https://github.com/declare-lab/flan-alpaca.
- Scaling instruction-finetuned language models. ArXiv, abs/2210.11416, 2022.
- Glm-130b: An open bilingual pre-trained model. ArXiv, abs/2210.02414, 2022.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- Agieval: A human-centric benchmark for evaluating foundation models, 2023.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022.
- Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv, abs/2210.09261, 2022.
- Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021.
- A general language assistant as a laboratory for alignment, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering, 2023.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
- What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.