Emergent Mind

Abstract

There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the OpenLLM benchmarks that test factual knowledge. We demonstrate this for several state-of-the-art LLMs (Llama-2-7B, Llama-2-13B, and Mistral-7B) and datasets (Alpaca-52k and Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0 while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses, thus ruling out any artificial improvement. In conclusion, our findings suggest that fine-tuning on the longest instructions should be the default baseline for any research on instruction fine-tuning.

Overview

  • The paper examines the efficacy of using the longest instruction-response pairs for instruction fine-tuning (IFT) of LLMs, promoting a simpler approach over more complex methods.

  • Fine-tuning on responses of greater length is shown to often surpass advanced techniques such as LIMA and AlpaGasus in improving task performance and model adherence to directives.

  • A comprehensive evaluation using datasets and LLMs like Alpaca-52k, Evol-Instruct-70k, Llama-2-7B, and Mistral-7B reinforces the effectiveness of the proposed baseline methodology.

  • The study finds that IFT not only improves alignment but can also enhance factual accuracy, suggesting a nuanced relationship between IFT dataset characteristics and model abilities.

Introduction

Instruction fine-tuning (IFT) of LLMs is a critical process that shapes these models to better adhere to human directives, enhancing their conversational capabilities and task performance. While advanced techniques such as LIMA and AlpaGasus leverage carefully curated high-quality examples to guide this process, the paper in question challenges the notion that IFT necessitates complex example-selection mechanisms.

Baseline Methodology

Research has emphasized the selection of high-quality IFT examples, but this study submits that selecting examples according to their response length – a straightforward and cost-effective method – can not only rival, but also outstrip more nuanced strategies. The authors extract 1,000 lengthiest instruction-response pairs from standard datasets like Alpaca-52k and Evol-Instruct-70k and demonstrate that models fine-tuned on these selections consistently outpace sophisticated methods, such as those deployed in LIMA and AlpaGasus, in head-to-head evaluations. These findings remain robust even when tested against alternative LLM judges like GPT-4 and PaLM-2.

Comprehensive Evaluation

A rigorous assessment across multiple datasets and LLMs (including Llama-2-7B, Llama-2-13B, and Mistral-7B) confirms the efficacy of the simple baseline. In particular, fine-tuning on the 1,000 longest responses often led to significantly better performance than more complex selection methods. Furthermore, a refined version of the longest-instruction dataset was created through a process resembling introspection and subsequently tested to show enhanced performance, affirming its value as an adept baseline for IFT research.

Implications and Analysis

In an unexpected turn, the findings suggest refining models on lengthy instructions may be beneficial beyond sheer alignment. When tested on factual knowledge benchmarks from Open LLM, the fine-tuned models generally maintained or improved factual accuracy, indicating that IFT can enhance factuality if the training dataset is sensibly selected. These results suggest an intricate relationship between the characteristics of the IFT dataset and the resulting model's abilities.

In essence, the paper overturns prior assumptions regarding IFT dataset construction, advocating for the potential of simple heuristics like response length as a baseline standard. The implications of these results are considerable and may prompt a re-evaluation of current methods in developing future language models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube