Training Language Models to Generate Text with Citations via Fine-grained Rewards (2402.04315v3)
Abstract: While recent LLMs have proven useful in answering user queries, they are prone to hallucination, and their responses often lack credibility due to missing references to reliable sources. An intuitive solution to these issues would be to include in-text citations referring to external documents as evidence. While previous works have directly prompted LLMs to generate in-text citations, their performances are far from satisfactory, especially when it comes to smaller LLMs. In this work, we propose an effective training framework using fine-grained rewards to teach LLMs to generate highly supportive and relevant citations, while ensuring the correctness of their responses. We also conduct a systematic analysis of applying these fine-grained rewards to common LLM training strategies, demonstrating its advantage over conventional practices. We conduct extensive experiments on Question Answering (QA) datasets taken from the ALCE benchmark and validate the model's generalizability using EXPERTQA. On LLaMA-2-7B, the incorporation of fine-grained rewards achieves the best performance among the baselines, even surpassing that of GPT-3.5-turbo.
- Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Eli5: Long form question answering. In Association for Computational Linguistics (ACL), pages 3558––3567.
- Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 16477–16508.
- Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465––6488.
- Realm: Retrieval augmented language model pre-training. In Proceedings of International Conference on Machine Learning, 2020, pages 33–40.
- Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303, 2022.
- True: Re-evaluating factual consistency evaluation. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 3905–3920.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1––38.
- Active retrieval augmented generation. arXiv preprint arXiv:2305.06983, 2023.
- Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems.
- Towards verifiable generation: A benchmark for knowledge-aware language model attribution. arXiv preprint arXiv:2310.05634.
- Ra-dit: Retrievalaugmented dual instruction tuning. arXiv preprint arXiv:2310.01352, 2022.
- Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225, 2023.
- Expertqa: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852, 2023.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251v1.
- Webgpt: Browser-assisted question answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844––9855.
- Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
- The web is your oyster - knowledge-intensive nlp against a very large web corpus. arXiv preprint arXiv:2112.09924, 2021.
- Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research (JMLR), 21(140).
- In-context retrieval-augmented language models. In Transactions of the Association for Computational Linguistics, 2023.
- Qampari: An open-domain question answering benchmark for questions with many answers from multiple paragraphs. arXiv preprint arXiv:2205.12665, 2022.
- Efficient rlhf: Reducing the memory usage of ppo. arXiv preprint arXiv:2309.00754, 2023.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Asqa: Factoid questions meet long-form answers. arXiv preprint arXiv:2204.06092, 2022.
- Conditionalqa: A complex reading comprehension dataset with conditional answers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, page 3627–3637.
- Towards verifiable text generation with evolving memory and self-reflection. arXiv preprint arXiv:2312.09075.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Fine-grained human feedback gives better rewards for language model training. In Advances in Neural Information Processing Systems.
- Effective large language model adaptation for improved grounding. arXiv preprint arXiv:2311.09533.
- Making retrieval-augmented language models robust to irrelevant context. arXiv preprint arXiv:2310.01558, 2023.
- Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311.
- Training language models with memory augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5657–5673.
- Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023.