2000 character limit reached
LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding (2404.05825v1)
Published 8 Apr 2024 in cs.IR and cs.AI
Abstract: Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through LLM augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.
- Do not have enough data? deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390.
- A simple but tough-to-beat baseline for sentence embeddings. International Conference on Learning Representations.
- Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144.
- Sebastian Bruch. 2021. An alternative cross entropy loss for learning-to-rank. In Proceedings of the web conference 2021, pages 118–126.
- Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, page 89–96, New York, NY, USA. Association for Computing Machinery.
- Contextualized offline relevance weighting for efficient and effective neural retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1617–1621.
- Click models for web search. Springer Nature.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086.
- From distillation to hard negative sampling: Making sparse neural ir models more effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2353–2359.
- Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2288–2292.
- Click chain model in web search. In Proceedings of the 18th international conference on World wide web, pages 11–20.
- Efficient multiple-click models in web search. In Proceedings of the second acm international conference on web search and data mining, pages 124–131.
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
- William P Jones and George W Furnas. 1987. Pictures of relevance: A geometric analysis of similarity measures. Journal of the American society for information science, 38(6):420–442.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
- Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems, 35:462–477.
- Adaptive margin ranking loss for knowledge graph embeddings via a correntropy objective function. arXiv preprint arXiv:1907.05336.
- Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085.
- Yannis Papanikolaou and Andrea Pierleoni. 2020. Dare: Data augmented relation extraction with gpt-2. arXiv preprint arXiv:2004.13845.
- Bridging the gap between relevance matching and semantic matching for short text similarity modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5370–5381.
- Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 42–49.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488.
- Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. arXiv preprint arXiv:2104.07540.
- Is chatgpt good at search? investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542.
- Improving document representations by generating pseudo query embeddings for dense retrieval. arXiv preprint arXiv:2105.03599.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
- The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 1313–1322.
- Offline pseudo relevance feedback for efficient and effective single-pass dense retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2209–2214.
- Optimizing web search using web click-through data. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 118–126.
- Generative data augmentation for commonsense reasoning. arXiv preprint arXiv:2004.11546.
- Wenhao Yu. 2022. Retrieval-augmented generation across heterogeneous knowledge. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 52–58.