Instruction-Following Evaluation for Large Language Models (2311.07911v1)

Published 14 Nov 2023 in cs.CL, cs.AI, and cs.LG

Abstract: One core capability of LLMs is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for LLMs. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Citations (102)

View on Semantic Scholar

Summary

The paper introduces IFEval, a novel benchmark that employs verifiable instructions to objectively assess LLMs' instruction-following abilities.
It synthesizes 500 prompts with 25 types of verifiable instructions and applies strict and loose metrics for robust evaluation.
Evaluation results demonstrate that GPT-4 significantly outperforms PaLM 2 Small, highlighting the framework's value for standardized LLM assessment.

Instruction-Following Evaluation for LLMs

The paper, "Instruction-Following Evaluation for LLMs," proposes IFEval, a novel benchmarking framework designed to assess the capacity of LLMs to comply with natural language instructions. This work addresses the inherent difficulties in evaluating instruction-following capabilities due to the subjective nature of human languages and the inefficiencies of current evaluation methods, including human evaluations, model-based evaluations, and quantitative benchmarks.

One of the notable contributions of this paper is the introduction of "verifiable instructions," which constitutes the backbone of IFEval. These instructions are designed to be objectively and automatically verifiable, thereby eliminating subjectivity and enhancing reproducibility. The authors identified 25 types of verifiable instructions and synthesized approximately 500 prompts, each containing one or more instructions. This structured approach offers a clear and deterministic way to assess LLMs' proficiency.

Key Components of IFEval

Verifiable Instructions: These are atomic instructions for which compliance can be verified through simple, interpretable programs. Examples include "write 450 to 500 words" and "end your email with: P.S. I do like the cake."
Prompt Synthesis: Prompts are generated by concatenating base prompts with randomly selected verifiable instructions, followed by few-shot prompting to filter out illogical prompts, rephrase them for diversity, and manual curation to ensure quality.
Evaluation Metrics: IFEval uses strict and loose metrics to reduce false negatives and increase robustness. The strict metric checks if all instructions in a prompt are followed exactly, while the loose metric considers variations and transformations that might still comply with the intended instruction.

Evaluation Results

The effectiveness of IFEval was demonstrated through evaluations of two widely used LLMs: GPT-4 and PaLM 2 Small. The results showed:

GPT-4: Achieved a prompt-level strict accuracy of 76.89% and an instruction-level strict accuracy of 83.57%.
PaLM 2 Small: Reported a prompt-level strict accuracy of 43.07% and an instruction-level strict accuracy of 55.76%.

These results highlight the differences in capabilities between the models and underline the importance of a standardized benchmark for comparing LLMs.

Implications and Future Work

The introduction of IFEval has significant practical and theoretical implications. Practically, it provides a scalable and reproducible method to evaluate the instruction-following capabilities of LLMs, which is critical for applications in sensitive domains like healthcare and autonomous systems. Theoretically, it offers a structured approach to understanding and improving LLM compliance with natural language instructions, paving the way for more reliable LLMs.

Future developments in AI, as suggested by the authors, could build upon IFEval to increase the diversity and quantity of verifiable instructions and extend evaluations to multi-modal use cases. This would further enhance the robustness and applicability of LLM evaluations in varied practical scenarios.

Conclusion

In summary, "Instruction-Following Evaluation for LLMs" introduces IFEval, a groundbreaking benchmark that leverages verifiable instructions to assess LLM compliance objectively and reproducibly. By standardizing evaluation methods, IFEval addresses the limitations of existing approaches and sets a new precedence for evaluating the instruction-following capabilities of LLMs. The practical utility and potential for future expansion make IFEval a valuable contribution to the field.