Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Augmented Embeddings for Custom Retrievals (2310.05380v1)

Published 9 Oct 2023 in cs.IR and cs.LG

Abstract: Information retrieval involves selecting artifacts from a corpus that are most relevant to a given search query. The flavor of retrieval typically used in classical applications can be termed as homogeneous and relaxed, where queries and corpus elements are both natural language (NL) utterances (homogeneous) and the goal is to pick most relevant elements from the corpus in the Top-K, where K is large, such as 10, 25, 50 or even 100 (relaxed). Recently, retrieval is being used extensively in preparing prompts for LLMs to enable LLMs to perform targeted tasks. These new applications of retrieval are often heterogeneous and strict -- the queries and the corpus contain different kinds of entities, such as NL and code, and there is a need for improving retrieval at Top-K for small values of K, such as K=1 or 3 or 5. Current dense retrieval techniques based on pretrained embeddings provide a general-purpose and powerful approach for retrieval, but they are oblivious to task-specific notions of similarity of heterogeneous artifacts. We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval. Adapted Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding. We empirically validate our approach by showing improvements over the state-of-the-art general-purpose embeddings-based baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Applying BERT to document retrieval with birch. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pp. 19–24, Hong Kong, China, November 2019a. Association for Computational Linguistics. doi: 10.18653/v1/D19-3004. URL https://aclanthology.org/D19-3004.
  2. Applying BERT to document retrieval with birch. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pp. 19–24, Hong Kong, China, November 2019b. Association for Computational Linguistics. doi: 10.18653/v1/D19-3004. URL https://aclanthology.org/D19-3004.
  3. Learning to retrieve reasoning paths over wikipedia graph for question answering. arXiv preprint arXiv:1911.10470, 2019.
  4. Task-aware retrieval with instructions, 2022.
  5. Self-supervised learning from images with a joint-embedding predictive architecture, 2023.
  6. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020a. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  7. Language models are few-shot learners. CoRR, abs/2005.14165, 2020b. URL https://arxiv.org/abs/2005.14165.
  8. Adapting ranking svm to document retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, pp.  186–193, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933697. doi: 10.1145/1148170.1148205. URL https://doi.org/10.1145/1148170.1148205.
  9. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pp.  129–136, New York, NY, USA, 2007. Association for Computing Machinery. ISBN 9781595937933. doi: 10.1145/1273496.1273513. URL https://doi.org/10.1145/1273496.1273513.
  10. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1), jan 2012. ISSN 0360-0300. doi: 10.1145/2071389.2071390. URL https://doi.org/10.1145/2071389.2071390.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. Efficient and effective spam filtering and re-ranking for large web datasets. Information retrieval, 14:441–465, 2011.
  13. “is this document relevant?…probably”: A survey of probabilistic models in information retrieval. ACM Comput. Surv., 30(4):528–552, dec 1998. ISSN 0360-0300. doi: 10.1145/299917.299920. URL https://doi.org/10.1145/299917.299920.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.
  16. Stuff i’ve seen: a system for personal information retrieval and re-use. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 72–79, 2003.
  17. LoRA: Low-rank adaptation of large language models, 2021.
  18. A bayesian learning approach to promoting diversity in ranking for biomedical information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp.  307–314, 2009.
  19. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769–6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
  20. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1896–1907, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.171. URL https://aclanthology.org/2020.findings-emnlp.171.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Latent retrieval for weakly supervised open domain question answering. In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  6086–6096. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1612. URL https://doi.org/10.18653/v1/p19-1612.
  23. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  24. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. arXiv preprint arXiv:1802.08979, 2018.
  25. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022.
  26. Tie-Yan Liu. Learning to rank for information retrieval. 3(3), 2009. ISSN 1554-0669. doi: 10.1561/1500000016. URL https://doi.org/10.1561/1500000016.
  27. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics, 9:329–345, 2021. doi: 10.1162/tacl˙a˙00369. URL https://aclanthology.org/2021.tacl-1.20.
  28. Adaptive relevance feedback in information retrieval. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp.  255–264, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585123. doi: 10.1145/1645953.1645988. URL https://doi.org/10.1145/1645953.1645988.
  29. Joint embedding vqa model based on dynamic word vector. PeerJ Computer Science, 7:e353, 2021.
  30. Introduction to Information Retrieval. Cambridge University Press, 2008.
  31. Efficient estimation of word representations in vector space, 2013.
  32. Text and code embeddings by contrastive pre-training, 2022a.
  33. Text and code embeddings by contrastive pre-training, 2022b.
  34. Multi-stage document ranking with bert, 2019.
  35. OpenAI. Gpt-4 technical report, 2023.
  36. Value-agnostic conversational semantic parsing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Online, August 2021. Association for Computational Linguistics.
  37. Improving language understanding by generative pre-training. 2018.
  38. Language models are unsupervised multitask learners. 2019.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
  40. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  41. Joseph John Rocchio Jr. Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing, 1971.
  42. Improving retrieval performance by relevance feedback. Journal of the American society for information science, 41(4):288–297, 1990.
  43. Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics, 8:556–571, September 2020. URL https://doi.org/10.1162/tacl_a_00333.
  44. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ.
  45. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  46. Neural ranking models for document retrieval. Information Retrieval Journal, 24, 12 2021. doi: 10.1007/s10791-021-09398-0.
  47. Inference networks for document retrieval. In Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 1–24, 1989.
  48. On the robustness and discriminative power of information retrieval metrics for top-n recommendation. In Proceedings of the 12th ACM conference on recommender systems, pp.  260–268, 2018.
  49. Ellen M Voorhees. Natural language processing and information retrieval. In International summer school on information extraction, pp. 32–48. Springer, 1999.
  50. Portfolio theory of information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp.  115–122, 2009.
  51. Robust fine-tuning of zero-shot models, 2022.
  52. Approximate nearest neighbor negative contrastive learning for dense text retrieval, 2020.
  53. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pp.  404–415, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450339001. doi: 10.1145/2884781.2884862. URL https://doi.org/10.1145/2884781.2884862.
  54. Learning to mine aligned code and natural language pairs from stack overflow. In International Conference on Mining Software Repositories, MSR, pp.  476–486. ACM, 2018. doi: https://doi.org/10.1145/3196398.3196408.
  55. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107, 2023.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube