Emergent Mind

Abstract

Instruction-tuned LLMs have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned LLMs. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance. While open-source models demonstrate impressive writing abilities, there is substantial room for improvement in problem-solving and alignment. We are encouraged by the rapid development of models by the open-source community, but we also highlight the need for rigorous evaluation to support claims made about these models. Through INSTRUCTEVAL, we aim to foster a deeper understanding of instruction-tuned models and advancements in their capabilities. INSTRUCTEVAL is publicly available at https://github.com/declare-lab/instruct-eval.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. GPT-4 Technical Report
  2. Gpt-neox-20b: An open-source autoregressive language model
  3. StabilityAI. Stablelm: Stability ai language models, April 2023. https://github.com/Stability-AI/StableLM.

  4. LLaMA: Open and Efficient Foundation Language Models
  5. Pythia: A suite for analyzing large language models across training and scaling
  6. Opt: Open pre-trained transformer language models
  7. UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=6ruVLB727MC.

  8. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. http://jmlr.org/papers/v21/20-074.html.

  9. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.26. https://aclanthology.org/2022.acl-long.26.

  10. Rwkv: Reinventing rnns for the transformer era
  11. MosaicML. Mpt-7b: A new standard for open-source, commercially usable llms, May 2023. https://www.mosaicml.com/blog/mpt-7b.

  12. Stanford alpaca: An instruction-following llama model, 2023. https://github.com/tatsu-lab/stanford_alpaca.

  13. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
  14. Self-instruct: Aligning language models with self-generated instructions
  15. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.244. https://aclanthology.org/2022.acl-long.244.

  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://vicuna.lmsys.org.

  17. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=9Vrb9D0WI4.

  18. Databricks Labs. Dolly, 2023. https://github.com/databrickslabs/dolly.

  19. Openassistant conversations – democratizing large language model alignment
  20. Training a helpful and harmless assistant with reinforcement learning from human feedback
  21. LAION-AI. Open-Assistant. https://github.com/LAION-AI/Open-Assistant

  22. Opt-iml: Scaling language model instruction meta learning through the lens of generalization
  23. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks
  24. Flan-alpaca: Instruction tuning from humans and machines, March 2023. https://github.com/declare-lab/flan-alpaca.

  25. Scaling Instruction-Finetuned Language Models
  26. GLM-130B: An Open Bilingual Pre-trained Model
  27. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=d7KBjmI3GmQ.

  28. Agieval: A human-centric benchmark for evaluating foundation models
  29. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
  30. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
  31. Evaluating Large Language Models Trained on Code
  32. A general language assistant as a laboratory for alignment
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  34. LoRA: Low-Rank Adaptation of Large Language Models
  35. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  36. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  37. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering
  38. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. https://aclanthology.org/2022.acl-long.556.

  39. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. https://aclanthology.org/2022.deelio-1.10.

Show All 39