RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs (2407.02485v1)
Abstract: LLMs typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.
- Topiocqa: Open-domain conversational question answering with topic switching. TACL, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Anthropic. Model card and evaluations for claude models. 2023.
- Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In ICLR, 2024a.
- Reliable, adaptable, and attributable language models with retrieval. arXiv preprint arXiv:2403.03187, 2024b.
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
- Semantic parsing on freebase from question-answer pairs. In EMNLP, 2013.
- Improving language models by retrieving from trillions of tokens. In ICML. PMLR, 2022.
- Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023a.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023b.
- Scaling instruction-finetuned language models. JMLR, 25(70), 2024.
- Free Dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
- Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP, 2019.
- DeepSeek. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
- Glam: Efficient scaling of language models with mixture-of-experts. In ICML, 2022.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL, 2019.
- Eli5: Long form question answering. In ACL, 2019.
- doc2dial: A goal-oriented document-grounded dialogue dataset. In EMNLP, 2020.
- Re2G: Retrieve, rerank, generate. In NAACL, 2022.
- Retrieval augmented language model pre-training. In ICML, 2020.
- Measuring massive multitask language understanding. In ICLR, 2021.
- Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In COLING, 2020.
- Unnatural instructions: Tuning language models with (almost) no human labor. In ACL, 2023.
- Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:2308.07922, 2023.
- Leveraging passage retrieval with generative models for open domain question answering. In EACL, 2021.
- Unsupervised dense information retrieval with contrastive learning. TMLR, 2022.
- Atlas: Few-shot learning with retrieval augmented language models. JMLR, 24(251):1–43, 2023.
- Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In NAACL, 2024.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Active retrieval augmented generation. In EMNLP, 2023.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
- Pubmedqa: A dataset for biomedical research question answering. In EMNLP, 2019.
- Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11), 2023.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017.
- Dense passage retrieval for open-domain question answering. In EMNLP, 2020.
- Realtime QA: What’s the answer right now? In NeurIPS, 2023.
- Few-shot reranking for multi-hop QA via language model prompting. In ACL, 2023.
- Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.
- Soda: Million-scale dialogue distillation with social commonsense contextualization. In EMNLP, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- The narrativeqa reading comprehension challenge. TACL, 2018.
- Natural questions: a benchmark for question answering research. TACL, 2019.
- Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv: 2304.07327, 2023.
- Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022.
- Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33, 2020.
- Reasoning over paragraph effects in situations. In Workshop on Machine Reading for Question Answering, 2019.
- How to train your dragon: Diverse augmentation towards generalizable dense retrieval. In Findings of EMNLP, 2023.
- RA-DIT: Retrieval-augmented dual instruction tuning. In ICLR, 2024.
- Chatqa: Surpassing gpt-4 on conversational qa and rag. arXiv preprint arXiv:2401.10225, 2024.
- The flan collection: Designing data and methods for effective instruction tuning. In ICML, 2023.
- Sparse, dense, and attentional representations for text retrieval. TACL, 2021.
- Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225, 2023.
- Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319, 2023.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In ACL, 2023.
- In defense of dual-encoders for neural ranking. In ICML, 2022.
- Meta-AI. Llama 3 model card. 2024.
- Mistral. Mixtral 8x22b. 2024. URL https://mistral.ai/news/mixtral-8x22b/.
- An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval, 2018.
- Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
- Document ranking with a pretrained sequence-to-sequence model. In Findings of EMNLP, 2020.
- OpenAI. Introducing ChatGPT, 2022.
- OpenAI. GPT-4, 2023.
- Proving test set contamination in black-box language models. In ICLR, 2024.
- Training language models to follow instructions with human feedback. NeurIPS, 35, 2022.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In CHIL, 2022.
- KILT: a benchmark for knowledge intensive language tasks. In NAACL, 2021.
- Large language models are effective text rankers with pairwise ranking prompting. In Findings of NAACL, 2024.
- Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
- In-context retrieval-augmented language models. TACL, 2023.
- Simple bm25 extension to multiple weighted fields. In CIKM, 2004.
- End-to-end training of multi-document reader and retriever for open-domain question answering. In NeurIPS, 2021.
- Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of EMNLP, 2023.
- Replug: Retrieval-augmented black-box language models. In NAACL, 2024.
- Is ChatGPT good at search? investigating large language models as re-ranking agents. In EMNLP, 2023.
- Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In NeurIPS, 2021.
- Fever: A large-scale dataset for fact extraction and verification. In NAACL, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Newsqa: A machine comprehension dataset. In RepL4NLP Workshop at ACL, 2017.
- Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In ACL, 2023.
- An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 2015.
- Instructretro: Instruction tuning post retrieval-augmented pretraining. In ICML, 2024.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023a.
- Self-instruct: Aligning language models with self-generated instructions. In ACL, 2023b.
- Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377, 2023c.
- Finetuned language models are zero-shot learners. In ICLR, 2022.
- Pmc-llama: toward building open-source language models for medicine. JAMIA, 2024.
- Inscit: Information-seeking conversations with mixed-initiative interactions. TACL, 2023.
- Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178, 2024.
- RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In ICLR, 2024a.
- Retrieval meets long context large language models. In ICLR, 2024b.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018.
- Making retrieval-augmented language models robust to irrelevant context. In ICLR, 2024.
- Generate rather than retrieve: Large language models are strong context generators. In ICLR, 2023a.
- Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023b.
- Improving language models via plug-and-play retrieval feedback, 2024.
- Coco-dr: Combating distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning. In EMNLP, 2022.
- Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024.
- Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In ACL, 2021.
- Inters: Unlocking the power of large language models in search with instruction tuning. In ACL, 2024.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.