Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Instruction-Following Evaluation for Large Language Models (2311.07911v1)

Published 14 Nov 2023 in cs.CL, cs.AI, and cs.LG

Abstract: One core capability of LLMs is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for LLMs. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Citations (102)

Summary

  • The paper introduces IFEval, a benchmark that uses 25 verifiable instruction types to objectively evaluate LLM instruction-following.
  • It employs a synthesis of few-shot prompting and manual curation, utilizing strict and loose accuracy metrics to assess compliance.
  • Evaluation results reveal that GPT-4 outperforms PaLM 2, highlighting both current challenges and future directions in LLM evaluation.

Instruction-Following Evaluation for LLMs

The paper "Instruction-Following Evaluation for LLMs" (2311.07911) introduces IFEval, a benchmark evaluating the instruction-following ability of LLMs. It utilizes a set of "verifiable instructions" designed to allow objective verification. This paper provides insights into the challenges of assessing how well LLMs adhere to given instructions and proposes a novel approach based on verifiable and reproducible metrics. Here, we explore the methodology, evaluation outcomes, and considerations for future developments.

Methodology and Implementation

Verifiable Instructions

The authors devised IFEval to evaluate models using objectively verifiable instructions. They created a list of 25 specific instruction types, such as "write at least 400 words" or "mention the keyword of AI at least three times." The evaluation process involves generating numerous prompts embedding these instructions and assessing whether a model's response complies with the given directives. This setup mitigates subjectivity in evaluating the performance of LLMs. Figure 1

Figure 1: Instructions such as "write at least 25 sentences" can be automatically and objectively verified. We build a set of prompts with verifiable instructions, for evaluating the instruction-following ability of LLMs.

Prompt Synthesis and Verification

The authors employed a rigorous synthesis method, combining few-shot prompting and manual curation to ensure prompt logic and diversity. Verification of compliance used both strict-accuracy and loose-accuracy metrics to accommodate nuances and edge cases in instruction adherence.

IFEval Metrics

The evaluation metrics include:

  • Prompt-level Accuracy: Accuracy based on whether all instructions in a prompt are followed.
  • Instruction-level Accuracy: Measures adherence on a per-instruction basis.
  • Strict vs. Loose Accuracy: Strict-accuracy requires exact compliance, while loose-accuracy allows for minor deviations to handle false negatives.

Evaluation Results

Benchmarking LLMs

The evaluation showcased the proficiency of GPT-4 and PaLM 2 in following instructions. The results indicated variability in instruction adherence across different categories, with GPT-4 generally outperforming PaLM 2. Figure 2

Figure 2: Instruction-level strict-accuracy of each model, separated by each instruction category.

Figure 3

Figure 3: Instruction following accuracy per detailed category.

The results, summarized in Table \ref{tab:accuracy-summarization}, reflect the current state of LLM capabilities and highlight areas for potential improvement in understanding and following complex instructions.

Discussion and Future Directions

Challenges and Limitations

While IFEval offers a robust framework, it primarily focuses on instructions that are easily verifiable. The authors acknowledge that real-world applications often involve more complex, less objectively verifiable instructions. Extending IFEval to include such nuances poses a future challenge.

Expanding the Framework

Suggested future work involves:

  1. Diversity Enhancement: Expanding the range of verifiable instructions and incorporating complex real-world scenarios.
  2. Multi-modal Extensions: Incorporating multi-modal evaluation capabilities, like generating captions for images or video.

By evolving the benchmark, IFEval can provide a comprehensive tool for future LLM development and evaluation.

Conclusion

The introduction of IFEval marks an important step toward standardized evaluation of instruction following by LLMs. Through its verifiable methodology, the benchmark addresses key challenges and offers a foundation for more comprehensive, automatic, and objective evaluation processes. The paper’s contributions pave the way for enhancing LLM instruction-following capabilities, crucial for their deployment in sensitive domains where precision and reliability are paramount.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 26 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com