RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs (2407.02485v1)
Abstract: LLMs typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.
- Topiocqa: Open-domain conversational question answering with topic switching. TACL, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Anthropic. Model card and evaluations for claude models. 2023.
- Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In ICLR, 2024a.
- Reliable, adaptable, and attributable language models with retrieval. arXiv preprint arXiv:2403.03187, 2024b.
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
- Semantic parsing on freebase from question-answer pairs. In EMNLP, 2013.
- Improving language models by retrieving from trillions of tokens. In ICML. PMLR, 2022.
- Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023a.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023b.
- Scaling instruction-finetuned language models. JMLR, 25(70), 2024.
- Free Dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
- Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP, 2019.
- DeepSeek. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
- Glam: Efficient scaling of language models with mixture-of-experts. In ICML, 2022.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL, 2019.
- Eli5: Long form question answering. In ACL, 2019.
- doc2dial: A goal-oriented document-grounded dialogue dataset. In EMNLP, 2020.
- Re2G: Retrieve, rerank, generate. In NAACL, 2022.
- Retrieval augmented language model pre-training. In ICML, 2020.
- Measuring massive multitask language understanding. In ICLR, 2021.
- Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In COLING, 2020.
- Unnatural instructions: Tuning language models with (almost) no human labor. In ACL, 2023.
- Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:2308.07922, 2023.
- Leveraging passage retrieval with generative models for open domain question answering. In EACL, 2021.
- Unsupervised dense information retrieval with contrastive learning. TMLR, 2022.
- Atlas: Few-shot learning with retrieval augmented language models. JMLR, 24(251):1–43, 2023.
- Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In NAACL, 2024.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Active retrieval augmented generation. In EMNLP, 2023.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
- Pubmedqa: A dataset for biomedical research question answering. In EMNLP, 2019.
- Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11), 2023.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017.
- Dense passage retrieval for open-domain question answering. In EMNLP, 2020.
- Realtime QA: What’s the answer right now? In NeurIPS, 2023.
- Few-shot reranking for multi-hop QA via language model prompting. In ACL, 2023.
- Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.
- Soda: Million-scale dialogue distillation with social commonsense contextualization. In EMNLP, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- The narrativeqa reading comprehension challenge. TACL, 2018.
- Natural questions: a benchmark for question answering research. TACL, 2019.
- Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv: 2304.07327, 2023.
- Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022.
- Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33, 2020.
- Reasoning over paragraph effects in situations. In Workshop on Machine Reading for Question Answering, 2019.
- How to train your dragon: Diverse augmentation towards generalizable dense retrieval. In Findings of EMNLP, 2023.
- RA-DIT: Retrieval-augmented dual instruction tuning. In ICLR, 2024.
- Chatqa: Surpassing gpt-4 on conversational qa and rag. arXiv preprint arXiv:2401.10225, 2024.
- The flan collection: Designing data and methods for effective instruction tuning. In ICML, 2023.
- Sparse, dense, and attentional representations for text retrieval. TACL, 2021.
- Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225, 2023.
- Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319, 2023.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In ACL, 2023.
- In defense of dual-encoders for neural ranking. In ICML, 2022.
- Meta-AI. Llama 3 model card. 2024.
- Mistral. Mixtral 8x22b. 2024. URL https://mistral.ai/news/mixtral-8x22b/.
- An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval, 2018.
- Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
- Document ranking with a pretrained sequence-to-sequence model. In Findings of EMNLP, 2020.
- OpenAI. Introducing ChatGPT, 2022.
- OpenAI. GPT-4, 2023.
- Proving test set contamination in black-box language models. In ICLR, 2024.
- Training language models to follow instructions with human feedback. NeurIPS, 35, 2022.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In CHIL, 2022.
- KILT: a benchmark for knowledge intensive language tasks. In NAACL, 2021.
- Large language models are effective text rankers with pairwise ranking prompting. In Findings of NAACL, 2024.
- Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
- In-context retrieval-augmented language models. TACL, 2023.
- Simple bm25 extension to multiple weighted fields. In CIKM, 2004.
- End-to-end training of multi-document reader and retriever for open-domain question answering. In NeurIPS, 2021.
- Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of EMNLP, 2023.
- Replug: Retrieval-augmented black-box language models. In NAACL, 2024.
- Is ChatGPT good at search? investigating large language models as re-ranking agents. In EMNLP, 2023.
- Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In NeurIPS, 2021.
- Fever: A large-scale dataset for fact extraction and verification. In NAACL, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Newsqa: A machine comprehension dataset. In RepL4NLP Workshop at ACL, 2017.
- Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In ACL, 2023.
- An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 2015.
- Instructretro: Instruction tuning post retrieval-augmented pretraining. In ICML, 2024.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023a.
- Self-instruct: Aligning language models with self-generated instructions. In ACL, 2023b.
- Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377, 2023c.
- Finetuned language models are zero-shot learners. In ICLR, 2022.
- Pmc-llama: toward building open-source language models for medicine. JAMIA, 2024.
- Inscit: Information-seeking conversations with mixed-initiative interactions. TACL, 2023.
- Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178, 2024.
- RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In ICLR, 2024a.
- Retrieval meets long context large language models. In ICLR, 2024b.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018.
- Making retrieval-augmented language models robust to irrelevant context. In ICLR, 2024.
- Generate rather than retrieve: Large language models are strong context generators. In ICLR, 2023a.
- Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023b.
- Improving language models via plug-and-play retrieval feedback, 2024.
- Coco-dr: Combating distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning. In EMNLP, 2022.
- Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024.
- Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In ACL, 2021.
- Inters: Unlocking the power of large language models in search with instruction tuning. In ACL, 2024.