Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 161 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 149 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency (2407.21443v1)

Published 31 Jul 2024 in cs.CL and cs.AI

Abstract: Despite LLMs have demonstrated impressive performance in various tasks, they are still suffering from the factual inconsistency problem called hallucinations. For instance, LLMs occasionally generate content that diverges from source article, and prefer to extract information that appears at the beginning and end of the context, especially in long document summarization. Inspired by these findings, we propose to improve the faithfulness of LLMs in summarization by impelling them to process the entire article more fairly and faithfully. We present a novel summary generation strategy, namely SliSum, which exploits the ideas of sliding windows and self-consistency. Specifically, SliSum divides the source article into overlapping windows, and utilizes LLM to generate local summaries for the content in the windows. Finally, SliSum aggregates all local summaries using clustering and majority voting algorithm to produce more faithful summary of entire article. Extensive experiments demonstrate that SliSum significantly improves the faithfulness of diverse LLMs including LLaMA-2, Claude-2 and GPT-3.5 in both short and long text summarization, while maintaining their fluency and informativeness and without additional fine-tuning and resources. We further conduct qualitative and quantitative studies to investigate why SliSum works and impacts of hyperparameters in SliSum on performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Anthropic. 2023. Model card and evaluations for claude models. Technical report, Anthropic.
  2. Correcting diverse factual errors in abstractive summarization via post-editing and language model infilling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9818–9830, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  3. CLIFF: Contrastive learning for improving faithfulness and factuality in abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6633–6649, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  4. HIBRIDS: Attention with hierarchical biases for structure-aware long document summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 786–807, Dublin, Ireland. Association for Computational Linguistics.
  5. Purr: Efficiently editing language model hallucinations by denoising language model corruptions.
  6. Towards improving faithfulness in abstractive summarization. In Advances in Neural Information Processing Systems, volume 35, pages 24516–24528. Curran Associates, Inc.
  7. Revisiting zero-shot abstractive summarization in the era of large language models from the perspective of position bias.
  8. Toward unifying text segmentation and long document summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 106–118, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  9. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Orleans, Louisiana. Association for Computational Linguistics.
  10. Sliding selector network with dynamic memory for extractive summarization of long documents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5881–5891, Online. Association for Computational Linguistics.
  11. Chain-of-verification reduces hallucination in large language models.
  12. Multi graph neural network for extractive long document summarization. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5870–5875, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  13. Improving factuality and reasoning in language models through multiagent debate.
  14. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 226–231, Portland, USA. AAAI Press.
  15. Improving factual consistency in summarization with compression-based post-editing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9149–9156, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  16. Factorizing content and budget decisions in abstractive summarization of long documents. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6341–6364, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  17. TrueTeacher: Learning factual consistency evaluation with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2053–2070, Singapore. Association for Computational Linguistics.
  18. CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations.
  19. News summarization and evaluation in the era of gpt-3.
  20. MemSum: Extractive summarization of long documents using multi-step episodic Markov decision processes. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6507–6522, Dublin, Ireland. Association for Computational Linguistics.
  21. Lm-infinite: Simple on-the-fly length generalization for large language models.
  22. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  23. Zero-shot faithful factual error correction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5660–5676, Toronto, Canada. Association for Computational Linguistics.
  24. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression.
  25. Knowledge graph-augmented language models for knowledge-grounded dialogue generation.
  26. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  27. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  28. LangChain. 2023a. Summarization option 2. map-reduce.
  29. LangChain. 2023b. Summarization option 3. refine.
  30. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore. Association for Computational Linguistics.
  31. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore. Association for Computational Linguistics.
  32. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  33. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  34. Lost in the middle: How language models use long contexts.
  35. A token-level reference-free hallucination detection benchmark for free-form text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6723–6737, Dublin, Ireland. Association for Computational Linguistics.
  36. On improving summarization factual consistency from natural language feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15144–15161, Toronto, Canada. Association for Computational Linguistics.
  37. Zero-resource hallucination prevention for large language models.
  38. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore. Association for Computational Linguistics.
  39. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  40. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. In The Twelfth International Conference on Learning Representations.
  41. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  42. OpenAI. 2023. Introducing chatgpt.
  43. Check your facts and try again: Improving large language models with external knowledge and automated feedback.
  44. HeterGraphLongSum: Heterogeneous graph neural network with passage aggregation for extractive long document summarization. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6248–6258, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  45. Incorporating distributions of discourse structure for long document abstractive summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5574–5590, Toronto, Canada. Association for Computational Linguistics.
  46. In-context retrieval-augmented language models.
  47. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, Toronto, Canada. Association for Computational Linguistics.
  48. On context utilization in summarization with large language models.
  49. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6252–6272, Toronto, Canada. Association for Computational Linguistics.
  50. Leveraging gpt-4 for food effect summarization to enhance product-specific guidance development via iterative prompting.
  51. Evaluating the factual consistency of large language models through news summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5220–5255, Toronto, Canada. Association for Computational Linguistics.
  52. Llama 2: Open foundation and fine-tuned chat models. Technical report, Meta.
  53. Focused transformer: Contrastive training for context scaling. In Advances in Neural Information Processing Systems, volume 36, pages 42661–42688. Curran Associates, Inc.
  54. Exploring neural models for query-focused summarization. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1455–1468, Seattle, United States. Association for Computational Linguistics.
  55. David Wan and Mohit Bansal. 2022. FactPEGASUS: Factuality-aware pre-training and fine-tuning for abstractive summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1010–1028, Seattle, United States. Association for Computational Linguistics.
  56. Large language models are not fair evaluators.
  57. Improving faithfulness by augmenting negative summaries from fake documents. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11913–11921, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  58. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  59. Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8640–8665, Toronto, Canada. Association for Computational Linguistics.
  60. FRSUM: Towards faithful abstractive summarization via enhancing factual robustness. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3640–3654, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  61. Personalized abstractive summarization by tri-agent generation pipeline. In Findings of the Association for Computational Linguistics: EACL 2024, pages 570–581, St. Julian’s, Malta. Association for Computational Linguistics.
  62. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations.
  63. Alleviating exposure bias via multi-level contrastive learning and deviation simulation in abstractive summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9732–9747, Toronto, Canada. Association for Computational Linguistics.
  64. GRETEL: Graph contrastive topic enhanced language model for long document extractive summarization. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6259–6269, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  65. Retrieval meets long context large language models. In The Twelfth International Conference on Learning Representations.
  66. Rcot: Detecting and rectifying factual inconsistency in reasoning by reversing chain-of-thought.
  67. HEGEL: Hypergraph transformer for long document summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10167–10176, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  68. Extractive summarization via ChatGPT for faithful summary generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3270–3278, Singapore. Association for Computational Linguistics.
  69. Improving the faithfulness of abstractive summarization via entity coverage control. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 528–535, Seattle, United States. Association for Computational Linguistics.
  70. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  71. Benchmarking large language models for news summarization.
  72. Siren’s song in the ai ocean: A survey on hallucination in large language models.
  73. Why does chatgpt fall short in providing truthful answers?
Citations (4)

Summary

  • The paper presents the SliSum approach, which employs a sliding window and self-consistency to generate more faithful summaries.
  • It divides documents into overlapping windows, filters contradictory content using lexical clustering, and aggregates summaries via majority vote.
  • Experiments across datasets show improved factual accuracy and computational efficiency compared to traditional LLM summarization methods.

Improving Faithfulness of LLMs in Summarization via Sliding Generation and Self-Consistency

Introduction

The paper focuses on enhancing the faithfulness of LLMs in summarization tasks, addressing a prevailing issue known as hallucinations where generated text may deviate from the source content. The key innovation presented is the SliSum approach, which employs a sliding window technique combined with self-consistency methods to generate more faithful summaries. The method involves dividing the source document into overlapping windows, generating local summaries, filtering contradictory statements, and employing a majority voting system to ensure consistency.

Methodology

SliSum Architecture

The SliSum framework consists of three main components:

  1. Sliding Generation:
    • Articles are divided into overlapping windows.
    • Each window is summarized independently using an LLM.
    • This generates local summaries that can vary in fidelity. Figure 1

      Figure 1: The pipeline and example of our proposed SliSum approach.

  2. Filtration:
    • Utilizing lexical clustering to filter out irrelevant or inaccurate local summary content.
    • Minimizing noise by retaining only frequently mentioned statements, promoting self-consistency.
  3. Aggregation:
    • Applying contradiction detection to identify semantically distinct statements about the same topic.
    • Employing a majority vote system to choose the most consistent statement, thus enhancing summary faithfulness.

Experiments and Results

The SliSum method was tested across several datasets, including CNN/DM, XSum, arXiv, and PubMed. These evaluations demonstrated that SliSum statistically improves the faithfulness of summaries without compromising fluency or informativeness. Notably, the use of overlapping windows effectively reduced the impact of the LLMs' positional biases. Figure 2

Figure 2: The performance of GPT-3.5 evaluated on samples of different length.

Hyperparameter Analysis

The impact of hyperparameters within SliSum was thoroughly analyzed, including window size and the ratio of window size to step size. Results indicated that optimizing the ratio leads to improved factual consistency, while excessively long windows can detract from the summary quality. Figure 3

Figure 3: Impact of ratio Lw/LsL_w / L_s (left) and window size (right) on faithfulness of GPT-3.5.

Complexity and Implementation Considerations

The theoretical analysis and empirical tests show that SliSum scales linearly with document length, offering computational efficiency compared to standard LLM summarization tasks which are quadratic in complexity. Furthermore, the method’s ability to parallelize certain operations provides additional performance gains.

Conclusion

The SliSum approach offers a practical and effective solution to improve the faithfulness of LLM-generated summaries. By leveraging sliding window techniques in conjunction with self-consistency mechanisms, SliSum enhances both short and long text summarization tasks. The reduction in hallucination without additional computational overhead makes it suitable for integration into existing LLM frameworks. Future research can explore extending these techniques to real-time applications and other text generation challenges.

Authors (3)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: