Emergent Mind

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

(2305.14251)
Published May 23, 2023 in cs.CL , cs.AI , and cs.LG

Abstract

Evaluating the factuality of long-form text generated by LLMs (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via pip install factscore.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.

  2. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  3. Re-evaluating evaluation in text summarization. In Proceedings of Empirical Methods in Natural Language Processing.
  4. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
  5. Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
  6. Language models are few-shot learners. In Proceedings of Advances in Neural Information Processing Systems.
  7. Reading Wikipedia to answer open-domain questions. In Proceedings of the Association for Computational Linguistics.
  8. Generating literal and implied subquestions to fact-check complex claims. In Proceedings of Empirical Methods in Natural Language Processing.
  9. Seeing things from a different angle:discovering diverse perspectives about claims. In Conference of the North American Chapter of the Association for Computational Linguistics.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
  11. Towards question-answering as an automatic metric for evaluating the content quality of a summary. Transactions of the Association for Computational Linguistics.
  12. Chain-of-Verification Reduces Hallucination in Large Language Models
  13. Is GPT-3 a Good Data Annotator?
  14. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Conference of the North American Chapter of the Association for Computational Linguistics.
  15. Generating fact checking briefs. In Proceedings of Empirical Methods in Natural Language Processing.
  16. RARR: Researching and Revising What Language Models Say, Using Language Models
  17. Enabling large language models to generate text with citations
  18. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
  19. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics.
  20. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems.
  21. Language Models (Mostly) Know What They Know
  22. WiCE: Real-World Entailment for Claims in Wikipedia
  23. Large Language Models Struggle to Learn Long-Tail Knowledge
  24. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the European Chapter of the Association for Computational Linguistics.
  25. Evaluating the factual consistency of abstractive text summarization. In Proceedings of Empirical Methods in Natural Language Processing.
  26. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics.
  27. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems.
  28. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine.
  29. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval.
  30. Evaluating Verifiability in Generative Search Engines
  31. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  32. Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
  33. Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In Proceedings of the Fourth Conference on Machine Translation.
  34. ExpertQA: Expert-Curated Questions and Attributed Answers
  35. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the Association for Computational Linguistics.
  36. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
  37. SemEval-2019 task 8: Fact checking in community question answering forums. In Proceedings of the 13th International Workshop on Semantic Evaluation.
  38. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL.
  39. Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. In Experimental IR Meets Multilinguality, Multimodality, and Interaction.
  40. Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Conference of the North American Chapter of the Association for Computational Linguistics.
  41. Large dual encoders are generalizable retrievers. In Proceedings of Empirical Methods in Natural Language Processing.
  42. Capabilities of GPT-4 on Medical Challenge Problems
  43. OpenAI. 2022. Chatgpt article. https://openai.com/blog/chatgpt.

  44. GPT-4 Technical Report
  45. Training language models to follow instructions with human feedback. In Proceedings of Advances in Neural Information Processing Systems.
  46. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Conference of the North American Chapter of the Association for Computational Linguistics.
  47. KILT: a benchmark for knowledge intensive language tasks. In Conference of the North American Chapter of the Association for Computational Linguistics.
  48. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of Empirical Methods in Natural Language Processing.
  49. Measuring Attribution in Natural Language Generation Models
  50. The role of context in detecting previously fact-checked claims. In Findings of the Association for Computational Linguistics: NAACL 2022.
  51. Crowdsourcing lightweight pyramids for manual summary evaluation. In Conference of the North American Chapter of the Association for Computational Linguistics.
  52. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021.
  53. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF
  54. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.

  55. Brian Thompson and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of Empirical Methods in Natural Language Processing.
  56. FEVER: a large-scale dataset for fact extraction and VERification. In Conference of the North American Chapter of the Association for Computational Linguistics.
  57. LLaMA: Open and Efficient Foundation Language Models
  58. Fact or fiction: Verifying scientific claims. In Proceedings of Empirical Methods in Natural Language Processing.
  59. SciFact-open: Towards open-domain scientific claim verification. In Findings of the Association for Computational Linguistics: EMNLP.
  60. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the Association for Computational Linguistics.
  61. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of Empirical Methods in Natural Language Processing.
  62. Paraphrastic representations at scale. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
  63. Generating scientific claims for zero-shot scientific fact checking. In Proceedings of the Association for Computational Linguistics.
  64. A critical evaluation of evaluations for long-form question answering. In Proceedings of the Association for Computational Linguistics.
  65. FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
  66. Automatic Evaluation of Attribution by Large Language Models
  67. Shiyue Zhang and Mohit Bansal. 2021. Finding a balanced degree of automation for summary evaluation. In Proceedings of Empirical Methods in Natural Language Processing.
  68. Bertscore: Evaluating text generation with bert. In Proceedings of the International Conference on Learning Representations.

Show All 68