Papers
Topics
Authors
Recent
Search
2000 character limit reached

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

Published 16 Nov 2023 in cs.CL | (2311.09829v1)

Abstract: The effective assessment of the instruction-following ability of LLMs is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In pursuit of this goal, various benchmarks have been constructed to evaluate the instruction-following capacity of these models. However, these benchmarks are limited to a single language and are constructed using automated approaches, which restricts their applicability and the quality of the test examples they contain. To bridge this gap, we introduce the FollowEval benchmark in this paper. This benchmark is composed of instances in both English and Chinese, and all test examples are crafted by human experts. Furthermore, the FollowEval benchmark is designed to assess LLMs across five critical dimensions of instruction following: string manipulation, commonsense reasoning, logical reasoning, spatial reasoning, and response constraints. To enhance the complexity and present a sufficient challenge, each test example is designed to evaluate more than one dimension. We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans. This highlights the considerable room for improvement in the instruction-following ability of these models.

Citations (1)

Summary

  • The paper introduces a bilingual, human-curated benchmark that evaluates LLMs on string manipulation, commonsense, logical, spatial reasoning, and response constraints.
  • It demonstrates a significant performance gap, with GPT-4 achieving around 77.5% accuracy compared to perfect human performance.
  • The methodology integrates expert verification and regex-based evaluation, offering robust insights for advancing instruction tuning and model alignment.

FollowEval: Evaluating Instruction-Following in LLMs

Introduction

The paper "FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of LLMs" (2311.09829) presents FollowEval, a novel evaluation framework explicitly designed to measure the proficiency of LLMs in following instructions across multiple dimensions. Unlike prior benchmarks, FollowEval is bilingual (English and Chinese) and fully human-curated to ensure rigor and representativeness. The benchmark aims to expose weaknesses in LLM instruction adherence, covering essential domains such as string manipulation, commonsense reasoning, logical reasoning, spatial reasoning, and response constraints. The authors demonstrate that current leading LLMs lag significantly behind human performance, emphasizing the need for advances in instruction tuning and model alignment.

Benchmark Design and Construction

FollowEval comprises 200 examples that target five core instruction-following dimensions:

  • String Manipulation: Position identification, character insertion, deletion, and replacement tasks probe text processing capacity.
  • Commonsense Reasoning: Instances require world knowledge and inference beyond explicit instructions.
  • Logical Reasoning: Mathematical logic and character counting are integrated to test formal reasoning.
  • Spatial Reasoning: Two- and three-dimensional conceptualization tasks (including character rotation and transformations) evaluate robustness.
  • Response Constraints: Length, formality, and character constraints establish boundaries for output fidelity.

The benchmark construction involves three stages: instruction drafting (six authors), expert verification (two separate individuals), and regex-based rule design for automatic evaluation (two specialists). Each test example is crafted to simultaneously challenge more than one essential element, intensifying overall difficulty.

Comparative Analysis with Existing Benchmarks

The authors position FollowEval against earlier benchmarks, which are generally:

  • Monolingual (English or Chinese only)
  • Automatically generated, which can introduce bias and reduce sample quality
  • Limited in scope with respect to compositional and multi-dimensional instruction following

By employing human curation and incorporating both English and Chinese samples, FollowEval addresses generalizability, linguistic diversity, and task complexity shortcomings of prior benchmarks (Chen et al., 2022, Yao et al., 2023, He et al., 2023, Mu et al., 2023, Sun et al., 2023, Jiang et al., 2023). The benchmark's regex-driven evaluation methodology enables efficient, consistent, and scalable assessment while mitigating carbon and financial overhead of LLM-based evaluation systems (Wang et al., 2023, Zheng et al., 2023).

Experimental Setup and Results

A representative set of both proprietary (GPT-4, GPT-3.5-Turbo) and open-source (Qwen-Chat, Baichuan, ChatGLM, InternLM, LLaMA-2, AquilaChat2) LLMs are probed using FollowEval. Multiple sampling-based decoding strategies (top-k, nucleus sampling) are used, and models are evaluated three times to mitigate stochasticity.

The main finding is a pronounced performance gap between LLMs and humans:

  • Human accuracy: 100%
  • Best LLM (GPT-4): ~77.5% average accuracy
  • Open-source models: Markedly lower, with accuracies declining as parameter count decreases

Parameter scaling correlates with improved performance within model series, but even the highest-performing LLM does not approach human baseline. Proprietary models outperform open-source ones, highlighting both the value and current limitations of non-public instruction tuning methodologies.

Implications and Future Directions

FollowEval's results reveal persistent limitations in LLM compliance with complex, multi-constraint instructions. Immediate practical implications include:

  • Model Selection: The benchmark offers a tool for discriminative selection of LLMs for high-stakes deployment.
  • Alignment Research: Results underscore the imperative for improved RLHF protocols and broader instruction tuning (including multi-dimensional objectives).
  • Low-Resource Languages: The authors note the future expansion potential of FollowEval to additional languages, encouraging wider applicability.

On the theoretical front, systematic multi-dimensional benchmarks like FollowEval may inform the development of universally robust instruction-following architectures, guiding compositional reasoning, constraint-satisfaction, and natural language understanding improvements. It also prompts the re-evaluation of automated benchmark generation and validation protocols due to the high sensitivity of instruction-following tasks to subtle linguistic and logical nuances.

Conclusion

FollowEval introduces a high-complexity, multi-dimensional benchmark for measuring LLM adherence to instructions in both English and Chinese. Human annotator involvement and regex-based evaluation ensure task rigor, linguistic diversity, and scalable assessment. Empirical results demonstrate significant deficits in LLM instruction-following relative to humans, particularly under compositional constraints and multi-domain tasks. FollowEval constitutes a critical tool for advancing instruction tuning, model alignment, and robust LLM evaluation, and its expansion to additional linguistic domains is recommended to further support research in effective human-AI interaction.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.