FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions (2403.15246v3)
Abstract: Modern LLMs (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions -- also known as narratives -- developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Trec 2017 common core track overview. In TREC, 2017.
- Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260, 2022.
- Trec 2006 legal track overview. In TREC. Citeseer, 2006.
- Gpt-neox-20b: An open-source autoregressive language model. ArXiv, abs/2204.06745, 2022. URL https://api.semanticscholar.org/CorpusID:248177957.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Generalizing conversational dense retrieval via llm-cognition data augmentation. arXiv preprint arXiv:2402.07092, 2024.
- Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Deeper text understanding for ir with contextual neural language modeling. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp. 985–988, 2019.
- Olmo: Accelerating the science of language models. 2024. URL https://api.semanticscholar.org/CorpusID:267365485.
- Hiyouga. Llama factory. https://github.com/hiyouga/LLaMA-Factory, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
- Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
- Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48, 2020.
- Overview of the TREC 2023 NeuCLIR track. 2024.
- The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pp. 22631–22648. PMLR, 2023.
- Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319, 2023.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.
- Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
- Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424, 2019.
- Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713, 2020.
- Overview of the trec 2008 legal track. In TREC, pp. 500–277, 2008.
- Instructir: A benchmark for instruction following of information retrieval models. arXiv preprint arXiv:2402.14334, 2024.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze! arXiv preprint arXiv:2312.02724, 2023.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
- Customizing contextualized language models for legal document reviews. In 2020 IEEE International Conference on Big Data (Big Data), pp. 2139–2148. IEEE, 2020.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- Ian Soboroff. Overview of trec 2021. In 30th Text REtrieval Conference. Gaithersburg, Maryland, 2021.
- Trec 2018 news track overview. In TREC, volume 409, pp. 410, 2018.
- Trec 2020 news track overview. In TREC, 2020.
- One embedder, any task: Instruction-finetuned text embeddings. 2022. URL https://arxiv.org/abs/2212.09741.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b. URL https://api.semanticscholar.org/CorpusID:259950998.
- Ellen M Voorhees. The trec robust retrieval track. In ACM SIGIR Forum, volume 39, pp. 11–20. ACM New York, NY, USA, 2005.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022a.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022c.
- Statistical power in retrieval experimentation. In Proceedings of the 17th ACM conference on Information and knowledge management, pp. 571–580, 2008.
- Learning from task descriptions. arXiv preprint arXiv:2011.08115, 2020.
- When do generative query and document expansions fail? a comprehensive study across methods, retrievers, and datasets. arXiv preprint arXiv:2309.08541, 2023.
- Nevir: Negation in neural information retrieval. Conference of the European Chapter of the Association for Computational Linguistics, 2024. URL https://api.semanticscholar.org/CorpusID:258676146.
- C-pack: Packaged resources to advance general chinese embedding, 2023.
- On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504, 2021.