- The paper introduces a bilingual, human-curated benchmark that evaluates LLMs on string manipulation, commonsense, logical, spatial reasoning, and response constraints.
- It demonstrates a significant performance gap, with GPT-4 achieving around 77.5% accuracy compared to perfect human performance.
- The methodology integrates expert verification and regex-based evaluation, offering robust insights for advancing instruction tuning and model alignment.
FollowEval: Evaluating Instruction-Following in LLMs
Introduction
The paper "FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of LLMs" (2311.09829) presents FollowEval, a novel evaluation framework explicitly designed to measure the proficiency of LLMs in following instructions across multiple dimensions. Unlike prior benchmarks, FollowEval is bilingual (English and Chinese) and fully human-curated to ensure rigor and representativeness. The benchmark aims to expose weaknesses in LLM instruction adherence, covering essential domains such as string manipulation, commonsense reasoning, logical reasoning, spatial reasoning, and response constraints. The authors demonstrate that current leading LLMs lag significantly behind human performance, emphasizing the need for advances in instruction tuning and model alignment.
Benchmark Design and Construction
FollowEval comprises 200 examples that target five core instruction-following dimensions:
- String Manipulation: Position identification, character insertion, deletion, and replacement tasks probe text processing capacity.
- Commonsense Reasoning: Instances require world knowledge and inference beyond explicit instructions.
- Logical Reasoning: Mathematical logic and character counting are integrated to test formal reasoning.
- Spatial Reasoning: Two- and three-dimensional conceptualization tasks (including character rotation and transformations) evaluate robustness.
- Response Constraints: Length, formality, and character constraints establish boundaries for output fidelity.
The benchmark construction involves three stages: instruction drafting (six authors), expert verification (two separate individuals), and regex-based rule design for automatic evaluation (two specialists). Each test example is crafted to simultaneously challenge more than one essential element, intensifying overall difficulty.
Comparative Analysis with Existing Benchmarks
The authors position FollowEval against earlier benchmarks, which are generally:
- Monolingual (English or Chinese only)
- Automatically generated, which can introduce bias and reduce sample quality
- Limited in scope with respect to compositional and multi-dimensional instruction following
By employing human curation and incorporating both English and Chinese samples, FollowEval addresses generalizability, linguistic diversity, and task complexity shortcomings of prior benchmarks (Chen et al., 2022, Yao et al., 2023, He et al., 2023, Mu et al., 2023, Sun et al., 2023, Jiang et al., 2023). The benchmark's regex-driven evaluation methodology enables efficient, consistent, and scalable assessment while mitigating carbon and financial overhead of LLM-based evaluation systems (Wang et al., 2023, Zheng et al., 2023).
Experimental Setup and Results
A representative set of both proprietary (GPT-4, GPT-3.5-Turbo) and open-source (Qwen-Chat, Baichuan, ChatGLM, InternLM, LLaMA-2, AquilaChat2) LLMs are probed using FollowEval. Multiple sampling-based decoding strategies (top-k, nucleus sampling) are used, and models are evaluated three times to mitigate stochasticity.
The main finding is a pronounced performance gap between LLMs and humans:
- Human accuracy: 100%
- Best LLM (GPT-4): ~77.5% average accuracy
- Open-source models: Markedly lower, with accuracies declining as parameter count decreases
Parameter scaling correlates with improved performance within model series, but even the highest-performing LLM does not approach human baseline. Proprietary models outperform open-source ones, highlighting both the value and current limitations of non-public instruction tuning methodologies.
Implications and Future Directions
FollowEval's results reveal persistent limitations in LLM compliance with complex, multi-constraint instructions. Immediate practical implications include:
- Model Selection: The benchmark offers a tool for discriminative selection of LLMs for high-stakes deployment.
- Alignment Research: Results underscore the imperative for improved RLHF protocols and broader instruction tuning (including multi-dimensional objectives).
- Low-Resource Languages: The authors note the future expansion potential of FollowEval to additional languages, encouraging wider applicability.
On the theoretical front, systematic multi-dimensional benchmarks like FollowEval may inform the development of universally robust instruction-following architectures, guiding compositional reasoning, constraint-satisfaction, and natural language understanding improvements. It also prompts the re-evaluation of automated benchmark generation and validation protocols due to the high sensitivity of instruction-following tasks to subtle linguistic and logical nuances.
Conclusion
FollowEval introduces a high-complexity, multi-dimensional benchmark for measuring LLM adherence to instructions in both English and Chinese. Human annotator involvement and regex-based evaluation ensure task rigor, linguistic diversity, and scalable assessment. Empirical results demonstrate significant deficits in LLM instruction-following relative to humans, particularly under compositional constraints and multi-domain tasks. FollowEval constitutes a critical tool for advancing instruction tuning, model alignment, and robust LLM evaluation, and its expansion to additional linguistic domains is recommended to further support research in effective human-AI interaction.