Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 178 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models (2402.14499v2)

Published 22 Feb 2024 in cs.CL

Abstract: The open-ended nature of language generation makes the evaluation of autoregressive LLMs challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model's diverse response styles such as starting with "Sure" or refusing to answer. Consequently, MCQ evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  2. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  3. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  4. Questioning the survey responses of large language models.
  5. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388.
  6. What’s going on with the open LLM leaderboard? https://huggingface.co/blog/evaluating-mmlu-leaderboard. Accessed: 2024-2-10.
  7. A framework for few-shot language model evaluation.
  8. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  9. Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040–5060, Singapore. Association for Computational Linguistics.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825.
  11. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  12. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  13. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  14. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  15. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  16. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  17. Whose opinions do language models reflect? ArXiv, abs/2303.17548.
  18. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China. Association for Computational Linguistics.
  19. Evaluating the moral beliefs encoded in llms. In Thirty-seventh Conference on Neural Information Processing Systems.
  20. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  21. Do LLMs exhibit human-like response biases? a case study in survey design. arXiv.
  22. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  23. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  24. Large language models are not robust multiple choice selectors. ArXiv, abs/2309.03882.
Citations (29)

Summary

  • The paper identifies that first-token log probabilities often diverge from complete text responses, with mismatch rates exceeding 60% in some models.
  • It demonstrates that stricter instruction constraints and larger model sizes reduce mismatches, though safety-induced refusals still impact evaluation accuracy.
  • The study critiques first-token evaluation methods and advocates for comprehensive techniques that consider full text outputs for reliable LLM assessment.

First-Token Probabilities and Instruction-Tuned Models

This essay explores the findings of the research paper titled "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned LLMs (2402.14499). The paper investigates the effectiveness of evaluating LLMs, particularly instruction-tuned models, using first-token probabilities within multiple-choice questions (MCQ) settings and identifies significant misalignments with the models' text outputs.

Evaluation of MCQ Accuracy

The evaluation of autoregressive LLMs often employs MCQs by leveraging first-token log probabilities to rank response options. This approach assumes that the highest probability first-token reflects the model's intention. Yet, the diversity in response styles, shaped by instruction-tuning, introduces variability in the model's outputs, resulting in frequent mismatches between first-token predictions and complete text outputs. The paper highlights a prevalent mismatch rate exceeding 60% in some models (e.g., Llama2-7b-Chat). Figure 1

Figure 1: Example of LLM's mismatch between first-token probability prediction (C'') and text output (A'').

For a comprehensive evaluation, the paper emphasizes moving beyond first-token probability by considering final text outputs. Experiments reveal a consistent divergence between these evaluation methods, especially in models fine-tuned for conversational contexts or safety.

Experimental Setup and Results

The researchers utilized the OpinionQA dataset, a survey-derived collection with topics sensitive enough to provoke refusals from models. They tested six instruction-tuned LLMs: Llama2 (7b, 13b, 70b), Mistral-Instruct (v0.1, v0.2), and Mixtral-8x7b. Each model was prompted with various constraint levels, from low constraint to high constraint instructions.

Mismatch Rates and Refusal Rates:

Refusal to answer, a significant pivot in the paper, was distinguished into two types: explicit selection of a "Refused" option and implicit refusal due to sensitive content. Figure 2

Figure 2

Figure 2: (a) Mismatch rate and (b) Refusal under the instruction of different constraint levels. The light color in the mismatch rate indicates the portion of mismatch due to refusal. Results are averaged across 10 runs.

The experiments demonstrated that more extensive models like Llama2-70b saw reduced mismatch rates, which decreased further with increased constraint levels. Yet, these matches remained significantly influenced by refusals, often driven by safety constraints. The paper also observed a non-trivial selection bias within the first-token methodology, exacerbated by template examples mimicking specific answer patterns. Figure 3

Figure 3

Figure 3: Result distribution of first token and text output based on example template with (a) "Answer: C" and (b) "Answer: A/B/C".

Impact of Decoding Temperature:

Temperature adjustments in decoding also unveiled effects on response consistency, altering mismatch and refusal rates by prioritizing answer diversity. Figure 4

Figure 4: Impact of decoding temperature. (a) Consistency. (b) Refusal and Mismatch rate.

Implications and Future Work

The paper raises critical questions about the reliability of first-token evaluation methods, particularly for instruction-tuned LLMs in domains necessitating sensitivity or when refusals are likely. The findings advocate for evaluation frameworks that align more closely with the natural text output to provide insights into LLM behavior in pragmatic settings.

The potential for first-token-based evaluations to mask selection biases and introduce unpredictability in subjective contexts critiques its adoption for LLM appraisal, advocating for thorough, nuanced text output analyses instead.

Conclusion

This research provides a rigorous examination of LLM evaluation methods using MCQs and underscores significant misalignments when relying solely on first-token probabilities. The implications highlight the necessity for more transparent and encompassing evaluation strategies. Future work should aim at exploring diverse probabilistic techniques, including candidate sequence probabilities, and their alignment with real-world LLM outputs to refine model assessment accuracy.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 40 likes.

Upgrade to Pro to view all of the tweets about this paper: