Emergent Mind

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

(2306.02707)
Published Jun 5, 2023 in cs.CL and cs.LG

Abstract

Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca (We are working with our legal team to publicly release a diff of the model weights in accordance with LLaMA's release policy to be published at https://aka.ms/orca-lm), a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. To promote this progressive learning, we tap into large-scale and diverse imitation data with judicious sampling and selection. Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4. Our research indicates that learning from step-by-step explanations, whether these are generated by humans or more advanced AI models, is a promising direction to improve model capabilities and skills.

Overview

  • Introduces Orca, a language model using Explanation Tuning to mimic GPT-4's reasoning.

  • Explanation traces from GPT-4 aid Orca's learning, enhancing its reasoning capabilities.

  • Benchmarks show Orca's competitive performance, especially in reasoning tasks.

  • Orca has limitations including biases and challenges in specific types of tasks.

  • Points to potential of smaller models and need for continued research in instruction following.

Introduction

The field of AI research has seen remarkable strides in the development and applications of language models. Large Foundation Models (LFMs), such as GPT-4, stand at the forefront of this advancement, exhibiting a breadth of capabilities across various tasks. However, scaling down LFMs while retaining their capabilities has remained challenging. Smaller models often lack diversity in training data, suffer from evaluation protocols that overestimate their abilities, and struggle to exhibit the comprehensive reasoning skills found in their larger counterparts. This paper introduces Orca, a language model that employs a novel training methodology known as Explanation Tuning to align more closely with the reasoning process of GPT-4.

Explanation Tuning

Orca's training diverges from traditional instruction tuning methods by incorporating full explanation traces during its learning phase. These explanations were generated by querying GPT-4 with complex prompts designed to elicit step-by-step reasoning. By learning from these detailed explanation traces, Orca aims to align its outputs more closely with the thought processes manifested in GPT-4's responses. This approach allows Orca to tap into a richer signal for learning, driving its capability to mimic GPT-4's reasoning and comprehension.

Experimentation and Evaluation

Orca underwent comprehensive experimentation to assess its generative and reasoning capabilities across various benchmarks. In open-ended tasks, Orca performs comparably to state-of-the-art models and even reaches parity with ChatGPT in specific instances. However, Orca lags behind GPT-4, highlighting the persistent challenge of closing the performance gap with more advanced LFMs. Orca also surpasses open-source models in reasoning tasks, showing significant improvement over Vicuna-13B in benchmarks like AGIEval, which consists of professional and academic exams. It's worth noting that Orca benefits considerably from the alignment with GPT-4, as well as content moderation tools, which contribute to its more nuanced output.

Limitations and Conclusion

Despite Orca's success in aligning closely with the capabilities of LFMs like GPT-4, there are limitations worth considering. Orca's performance is susceptible to biases within the data used for training and may still carry the limitations inherent in the LFM it imitates. Furthermore, the model faces challenges in handling scenarios not covered in its training data, such as multi-turn conversations or tasks requiring in-context learning. Additionally, the safety aspects of Orca in generating content are an area for future improvement.

As Orca represents a step forward in machine learning research, particularly in the domain of language understanding and instruction following, it showcases the potential for smaller models to serve effectively in settings where full-scale LFMs may be impractical. However, the study acknowledges that continued research is necessary to refine methods, enhance evaluation protocols, and effectively leverage the wisdom of larger models in teaching their smaller counterparts.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube