FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (2305.14251v2)
Abstract: Evaluating the factuality of long-form text generated by LLMs (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong LLM, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via pip install factscore
.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Re-evaluating evaluation in text summarization. In Proceedings of Empirical Methods in Natural Language Processing.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
- Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037.
- Language models are few-shot learners. In Proceedings of Advances in Neural Information Processing Systems.
- Reading Wikipedia to answer open-domain questions. In Proceedings of the Association for Computational Linguistics.
- Generating literal and implied subquestions to fact-check complex claims. In Proceedings of Empirical Methods in Natural Language Processing.
- Seeing things from a different angle:discovering diverse perspectives about claims. In Conference of the North American Chapter of the Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Towards question-answering as an automatic metric for evaluating the content quality of a summary. Transactions of the Association for Computational Linguistics.
- Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
- Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450.
- QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Conference of the North American Chapter of the Association for Computational Linguistics.
- Generating fact checking briefs. In Proceedings of Empirical Methods in Natural Language Processing.
- Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726.
- Enabling large language models to generate text with citations.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arxiv:2301.07597.
- Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics.
- An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Wice: Real-world entailment for claims in wikipedia. arXiv preprint arXiv:2303.01432.
- Large language models struggle to learn long-tail knowledge. arXiv preprint arXiv:2211.08411.
- LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the European Chapter of the Association for Computational Linguistics.
- Evaluating the factual consistency of abstractive text summarization. In Proceedings of Empirical Methods in Natural Language Processing.
- SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics.
- Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems.
- Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine.
- Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval.
- Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981.
- Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In Proceedings of the Fourth Conference on Machine Translation.
- Expertqa: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the Association for Computational Linguistics.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
- SemEval-2019 task 8: Fact checking in community question answering forums. In Proceedings of the 13th International Workshop on Semantic Evaluation.
- Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL.
- Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. In Experimental IR Meets Multilinguality, Multimodality, and Interaction.
- Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Conference of the North American Chapter of the Association for Computational Linguistics.
- Large dual encoders are generalizable retrievers. In Proceedings of Empirical Methods in Natural Language Processing.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
- OpenAI. 2022. Chatgpt blog post. https://openai.com/blog/chatgpt.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. In Proceedings of Advances in Neural Information Processing Systems.
- Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Conference of the North American Chapter of the Association for Computational Linguistics.
- KILT: a benchmark for knowledge intensive language tasks. In Conference of the North American Chapter of the Association for Computational Linguistics.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of Empirical Methods in Natural Language Processing.
- Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870.
- The role of context in detecting previously fact-checked claims. In Findings of the Association for Computational Linguistics: NAACL 2022.
- Crowdsourcing lightweight pyramids for manual summary evaluation. In Conference of the North American Chapter of the Association for Computational Linguistics.
- Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021.
- Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of rlhf. arXiv preprint arXiv:2309.09055.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Brian Thompson and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of Empirical Methods in Natural Language Processing.
- FEVER: a large-scale dataset for fact extraction and VERification. In Conference of the North American Chapter of the Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Fact or fiction: Verifying scientific claims. In Proceedings of Empirical Methods in Natural Language Processing.
- SciFact-open: Towards open-domain scientific claim verification. In Findings of the Association for Computational Linguistics: EMNLP.
- Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the Association for Computational Linguistics.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of Empirical Methods in Natural Language Processing.
- Paraphrastic representations at scale. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
- Generating scientific claims for zero-shot scientific fact checking. In Proceedings of the Association for Computational Linguistics.
- A critical evaluation of evaluations for long-form question answering. In Proceedings of the Association for Computational Linguistics.
- Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
- Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311.
- Shiyue Zhang and Mohit Bansal. 2021. Finding a balanced degree of automation for summary evaluation. In Proceedings of Empirical Methods in Natural Language Processing.
- Bertscore: Evaluating text generation with bert. In Proceedings of the International Conference on Learning Representations.