Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets (2307.10928v4)

Published 20 Jul 2023 in cs.CL and cs.AI

Abstract: Evaluation of LLMs is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. In this paper, we introduce FLASK (Fine-grained LLM Evaluation based on Alignment Skill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations. We publicly release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.

Citations (74)

Summary

  • The paper introduces FLASK, a method that evaluates LLMs using 12 fine-grained skills across four core alignment abilities.
  • It empirically shows that granular evaluation correlates strongly with human judgments while revealing key performance gaps in logical reasoning and factual knowledge.
  • The approach offers actionable insights for researchers and practitioners to improve model selection and fine-tune LLMs for better task-specific performance and user alignment.

Fine-grained LLM Evaluation based on Alignment Skill Sets

The paper introduces FLASK (Fine-grained LLM Evaluation based on Alignment Skill Sets), offering a nuanced protocol for evaluating LLMs. LLMs' ability to align with human instructions has shown significant advancements, driven by techniques like reinforcement learning from human feedback (RLHF). However, their evaluation often remains constrained to coarse-grained metrics that offer limited insight into model capabilities. FLASK proposes a more granular approach, assessing models based on specific skill sets required to address diverse user instructions.

FLASK categorizes model evaluation into four primary abilities: Logical Thinking, Background Knowledge, Problem Handling, and User Alignment. Each category is further divided into 12 fine-grained skills. This detailed taxonomy allows for a comprehensive assessment of an LLM's performance across different instructional contexts, providing insights that are not apparent from overall preference scoring.

The paper underscores the importance of fine-grained evaluation through empirical validation. The authors demonstrate that such specificity increases the reliability and interpretability of evaluations by correlating strongly with human-based assessment and mitigating the biases inherent in model-based evaluations. Additionally, FLASK reveals that current open-source LLMs, despite their advancements, fall behind proprietary models like GPT-3.5 and Bard in Logical Thinking and Background Knowledge abilities. The analysis within the paper indicates that models like Vicuna and WizardLM perform comparably to proprietary models in Problem Handling and User Alignment, but significant gaps remain in areas that demand logical reasoning and factual knowledge.

One of the notable contributions of FLASK is its ability to reveal the limitations of state-of-the-art models on complex tasks through its FLASK-Hard subset. Even leading LLMs such as GPT-4 exhibit up to a 50% performance degradation on these challenging instances, highlighting areas for further research and development.

The implications of FLASK are substantial for both academic and practical applications of AI. For researchers, FLASK provides a robust framework to dissect the strengths and weaknesses of LLMs, facilitating targeted improvements. Developers and practitioners can utilize FLASK to select or design models tailored to specific tasks, ensuring better alignment with end-user needs. Moreover, the insights derived from fine-grained evaluations could inform the development of future LLMs, guiding them towards more nuanced task handling and refined user alignment.

In conclusion, FLASK sets the stage for more nuanced AI evaluations, urging the community to look beyond aggregated scores and delve into task-specific capabilities. This approach promises to enhance the fidelity and applicability of LLM evaluations, steering the development of more sophisticated and human-aligned AI systems. Future work may explore extending FLASK's methodologies to multi-turn interactions, multimodal tasks, and non-English evaluations, expanding its applicability in the diverse landscape of AI challenges.