Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning (2402.04833v2)

Published 7 Feb 2024 in cs.CL

Abstract: There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses -- that intuitively contain more learnable information and are harder to overfit -- from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the Open LLM benchmarks that test factual knowledge. We demonstrate this for several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses. Overall, our findings suggest that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning. We provide our code at https://github.com/tml-epfl/long-is-more-for-alignment.

Citations (30)

View on Semantic Scholar

Summary

The paper demonstrates that using the longest instruction-response pairs as a baseline for fine-tuning LLMs achieves superior alignment performance.
It employs a simple methodology by selecting 1,000 longest pairs from standard datasets, rivaling more intricate example-selection techniques.
Comprehensive evaluations across models reveal that fine-tuning on lengthy instructions not only boosts task performance but also maintains or improves factual accuracy.

Introduction

Instruction fine-tuning (IFT) of LLMs is a critical process that shapes these models to better adhere to human directives, enhancing their conversational capabilities and task performance. While advanced techniques such as LIMA and AlpaGasus leverage carefully curated high-quality examples to guide this process, the paper in question challenges the notion that IFT necessitates complex example-selection mechanisms.

Baseline Methodology

Research has emphasized the selection of high-quality IFT examples, but this paper submits that selecting examples according to their response length – a straightforward and cost-effective method – can not only rival, but also outstrip more nuanced strategies. The authors extract 1,000 lengthiest instruction-response pairs from standard datasets like Alpaca-52k and Evol-Instruct-70k and demonstrate that models fine-tuned on these selections consistently outpace sophisticated methods, such as those deployed in LIMA and AlpaGasus, in head-to-head evaluations. These findings remain robust even when tested against alternative LLM judges like GPT-4 and PaLM-2.

Comprehensive Evaluation

A rigorous assessment across multiple datasets and LLMs (including Llama-2-7B, Llama-2-13B, and Mistral-7B) confirms the efficacy of the simple baseline. In particular, fine-tuning on the 1,000 longest responses often led to significantly better performance than more complex selection methods. Furthermore, a refined version of the longest-instruction dataset was created through a process resembling introspection and subsequently tested to show enhanced performance, affirming its value as an adept baseline for IFT research.

Implications and Analysis

In an unexpected turn, the findings suggest refining models on lengthy instructions may be beneficial beyond sheer alignment. When tested on factual knowledge benchmarks from Open LLM, the fine-tuned models generally maintained or improved factual accuracy, indicating that IFT can enhance factuality if the training dataset is sensibly selected. These results suggest an intricate relationship between the characteristics of the IFT dataset and the resulting model's abilities.

In essence, the paper overturns prior assumptions regarding IFT dataset construction, advocating for the potential of simple heuristics like response length as a baseline standard. The implications of these results are considerable and may prompt a re-evaluation of current methods in developing future LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1755420905935679607

https://twitter.com/maksym_andr/status/1755636355537715564

https://twitter.com/maksym_andr/status/1815998727112249447

https://twitter.com/fly51fly/status/1755724331592884272

https://twitter.com/_lewtun/status/1758520263698698348

https://twitter.com/maksym_andr/status/1755636376375062844

HackerNews

Long Is More for Alignment: A Simple Baseline for Instruction Fine-Tuning (1 point, 1 comment)