"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models (2402.14499v2)

Published 22 Feb 2024 in cs.CL

Abstract: The open-ended nature of language generation makes the evaluation of autoregressive LLMs challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model's diverse response styles such as starting with "Sure" or refusing to answer. Consequently, MCQ evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

References (24)

Citations (29)

View on Semantic Scholar

Summary

The paper identifies that first-token log probabilities often diverge from complete text responses, with mismatch rates exceeding 60% in some models.
It demonstrates that stricter instruction constraints and larger model sizes reduce mismatches, though safety-induced refusals still impact evaluation accuracy.
The study critiques first-token evaluation methods and advocates for comprehensive techniques that consider full text outputs for reliable LLM assessment.

First-Token Probabilities and Instruction-Tuned Models

This essay explores the findings of the research paper titled "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned LLMs (2402.14499). The paper investigates the effectiveness of evaluating LLMs, particularly instruction-tuned models, using first-token probabilities within multiple-choice questions (MCQ) settings and identifies significant misalignments with the models' text outputs.

Evaluation of MCQ Accuracy

The evaluation of autoregressive LLMs often employs MCQs by leveraging first-token log probabilities to rank response options. This approach assumes that the highest probability first-token reflects the model's intention. Yet, the diversity in response styles, shaped by instruction-tuning, introduces variability in the model's outputs, resulting in frequent mismatches between first-token predictions and complete text outputs. The paper highlights a prevalent mismatch rate exceeding 60% in some models (e.g., Llama2-7b-Chat).

Figure 1: Example of LLM's mismatch between first-token probability prediction (C'') and text output (A'').

For a comprehensive evaluation, the paper emphasizes moving beyond first-token probability by considering final text outputs. Experiments reveal a consistent divergence between these evaluation methods, especially in models fine-tuned for conversational contexts or safety.

Experimental Setup and Results

The researchers utilized the OpinionQA dataset, a survey-derived collection with topics sensitive enough to provoke refusals from models. They tested six instruction-tuned LLMs: Llama2 (7b, 13b, 70b), Mistral-Instruct (v0.1, v0.2), and Mixtral-8x7b. Each model was prompted with various constraint levels, from low constraint to high constraint instructions.

Mismatch Rates and Refusal Rates:

Refusal to answer, a significant pivot in the paper, was distinguished into two types: explicit selection of a "Refused" option and implicit refusal due to sensitive content.

Figure 2: (a) Mismatch rate and (b) Refusal under the instruction of different constraint levels. The light color in the mismatch rate indicates the portion of mismatch due to refusal. Results are averaged across 10 runs.

The experiments demonstrated that more extensive models like Llama2-70b saw reduced mismatch rates, which decreased further with increased constraint levels. Yet, these matches remained significantly influenced by refusals, often driven by safety constraints. The paper also observed a non-trivial selection bias within the first-token methodology, exacerbated by template examples mimicking specific answer patterns.

Figure 3: Result distribution of first token and text output based on example template with (a) "Answer: C" and (b) "Answer: A/B/C".

Impact of Decoding Temperature:

Temperature adjustments in decoding also unveiled effects on response consistency, altering mismatch and refusal rates by prioritizing answer diversity.

Figure 4: Impact of decoding temperature. (a) Consistency. (b) Refusal and Mismatch rate.

Implications and Future Work

The paper raises critical questions about the reliability of first-token evaluation methods, particularly for instruction-tuned LLMs in domains necessitating sensitivity or when refusals are likely. The findings advocate for evaluation frameworks that align more closely with the natural text output to provide insights into LLM behavior in pragmatic settings.

The potential for first-token-based evaluations to mask selection biases and introduce unpredictability in subjective contexts critiques its adoption for LLM appraisal, advocating for thorough, nuanced text output analyses instead.

Conclusion

This research provides a rigorous examination of LLM evaluation methods using MCQs and underscores significant misalignments when relying solely on first-token probabilities. The implications highlight the necessity for more transparent and encompassing evaluation strategies. Future work should aim at exploring diverse probabilistic techniques, including candidate sequence probabilities, and their alignment with real-world LLM outputs to refine model assessment accuracy.