Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Atomic Self-Consistency for Better Long Form Generations (2405.13131v1)

Published 21 May 2024 in cs.CL

Abstract: Recent work has aimed to improve LLM generations by filtering out hallucinations, thereby improving the precision of the information in responses. Correctness of a long-form response, however, also depends on the recall of multiple pieces of information relevant to the question. In this paper, we introduce Atomic Self-Consistency (ASC), a technique for improving the recall of relevant information in an LLM response. ASC follows recent work, Universal Self-Consistency (USC) in using multiple stochastic samples from an LLM to improve the long-form response. Unlike USC which only focuses on selecting the best single generation, ASC picks authentic subparts from the samples and merges them into a superior composite answer. Through extensive experiments and ablations, we show that merging relevant subparts of multiple samples performs significantly better than picking a single sample. ASC demonstrates significant gains over USC on multiple factoids and open-ended QA datasets - ASQA, QAMPARI, QUEST, ELI5 with ChatGPT and Llama2. Our analysis also reveals untapped potential for enhancing long-form generations using approach of merging multiple samples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  3. A benchmark dataset of check-worthy factual claims. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pages 821–829.
  4. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  6. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
  7. Chain-of-verification reduces hallucination in large language models.
  8. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764.
  9. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190.
  10. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  11. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.
  12. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.
  13. Quest: A retrieval dataset of entity-seeking queries with implicit set operations. arXiv preprint arXiv:2305.11694.
  14. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  15. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
  16. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
  17. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.
  18. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563.
  19. Self-evaluation improves selective generation in large language models. arXiv preprint arXiv:2312.09300.
  20. Qampari:: An open-domain question answering benchmark for questions with many answers from multiple paragraphs. arXiv preprint arXiv:2205.12665.
  21. Natural language to code translation with execution. arXiv preprint arXiv:2204.11454.
  22. Asqa: Factoid questions meet long-form answers. arXiv preprint arXiv:2204.06092.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  24. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  25. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  26. Self-consistent reasoning for solving math word problems. arXiv preprint arXiv:2210.15373.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: