Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 159 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Fine-Grained Self-Endorsement Improves Factuality and Reasoning (2402.15631v1)

Published 23 Feb 2024 in cs.CL and cs.AI

Abstract: This work studies improving LLM generations at inference time by mitigating fact-conflicting hallucinations. Particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. Compared with prior ensemble methods (Wang et al., 2022;Chen et al., 2023)) that perform response-level selection, our approach can better alleviate hallucinations, especially for longform generation tasks. Our approach can broadly benefit smaller and open-source LLMs as it mainly conducts simple content-based comparisons. Experiments on Biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of LLMs. Besides, comprehensive analyses on TriviaQA and GSM8K demonstrate the potential of self-endorsement for broader application.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
  3. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  5. Chain-of-verification reduces hallucination in large language models.
  6. Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726.
  7. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
  8. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  9. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  10. Large language models can self-improve. arXiv preprint arXiv:2210.11610.
  11. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713.
  12. Mistral 7b. arXiv preprint arXiv:2310.06825.
  13. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
  14. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599.
  15. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
  16. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  17. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.
  18. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  19. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822.
  20. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  21. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
  22. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
  23. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
  24. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  25. Fine-tuning language models for factuality. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935.
  28. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  29. Eva-kellm: A new benchmark for evaluating knowledge editing of llms. arXiv preprint arXiv:2308.09954.
  30. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.
  31. Large language models as optimizers. arXiv preprint arXiv:2309.03409.
  32. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  33. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  34. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.