FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (2403.02270v3)
Abstract: Recent advancements in text summarization, particularly with the advent of LLMs, have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.
- Longformer: The long-document transformer. CoRR, abs/2004.05150.
- The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition, pages 3121–3124.
- Evaluating factual consistency of summaries with large language models.
- Yanran Chen and Steffen Eger. 2023. Menli: Robust evaluation metrics from natural language inference.
- FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
- QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States. Association for Computational Linguistics.
- SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
- Evaluating factual consistency of texts with semantic role labeling.
- Gptscore: Evaluate as you desire.
- Human-like summarization evaluation with chatgpt.
- Automated pyramid summarization evaluation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 404–418, Hong Kong, China. Association for Computational Linguistics.
- Trueteacher: Learning factual consistency evaluation with large language models.
- Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3592–3603, Online. Association for Computational Linguistics.
- Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1449–1462, Online. Association for Computational Linguistics.
- LongT5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States. Association for Computational Linguistics.
- SumPubMed: Summarization dataset of PubMed scientific articles. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 292–303, Online. Association for Computational Linguistics.
- Automation of summary evaluation by the pyramid method. In International Conference on Recent Advances in Natural Language Processing, RANLP 2005 - Proceedings, International Conference Recent Advances in Natural Language Processing, RANLP, pages 226–232. Association for Computational Linguistics (ACL). International Conference on Recent Advances in Natural Language Processing, RANLP 2005 ; Conference date: 21-09-2005 Through 23-09-2005.
- Deberta: Decoding-enhanced BERT with disentangled attention. CoRR, abs/2006.03654.
- LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics.
- Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
- Booksum: A collection of datasets for long-form narrative summarization.
- SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- G-eval: Nlg evaluation using gpt-4 with better human alignment.
- Roberta: A robustly optimized bert pretraining approach.
- Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation.
- Chatgpt as a factual inconsistency evaluator for text summarization.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.
- Abstractive text summarization using sequence-to-sequence rnns and beyond.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
- The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process., 4(2):4–es.
- Adversarial NLI: A new benchmark for natural language understanding. CoRR, abs/1910.14599.
- F-coref: Fast, accurate and easy to use coreference resolution. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations, pages 48–56, Taipei, Taiwan. Association for Computational Linguistics.
- Does putting a linguist in the loop improve NLU data collection? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4886–4901, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
- FactGraph: Evaluating factuality in summarization with semantic graph representations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3238–3253, Seattle, United States. Association for Computational Linguistics.
- QuestEval: Summarization asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Echoes from alexandria: A large resource for multilingual book summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 853–867, Toronto, Canada. Association for Computational Linguistics.
- Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235.
- Large language models are not yet human-level evaluators for abstractive summarization.
- Evaluating the factual consistency of large language models through summarization.
- Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11626–11644, Toronto, Canada. Association for Computational Linguistics.
- Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
- Is chatgpt a good nlg evaluator? a preliminary study.
- Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method.
- A broad-coverage challenge corpus for sentence understanding through inference. CoRR, abs/1704.05426.
- AlignScore: Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
- Shiyue Zhang and Mohit Bansal. 2021. Finding a balanced degree of automation for summary evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6617–6632, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Bertscore: Evaluating text generation with bert.