Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation (2404.02489v1)

Published 3 Apr 2024 in cs.IR and cs.CL

Abstract: State-of-the-art neural rankers pre-trained on large task-specific training data such as MS-MARCO, have been shown to exhibit strong performance on various ranking tasks without domain adaptation, also called zero-shot. However, zero-shot neural ranking may be sub-optimal, as it does not take advantage of the target domain information. Unfortunately, acquiring sufficiently large and high quality target training data to improve a modern neural ranker can be costly and time-consuming. To address this problem, we propose a new approach to unsupervised domain adaptation for ranking, DUQGen, which addresses a critical gap in prior literature, namely how to automatically generate both effective and diverse synthetic training data to fine tune a modern neural ranker for a new domain. Specifically, DUQGen produces a more effective representation of the target domain by identifying clusters of similar documents; and generates a more diverse training dataset by probabilistic sampling over the resulting document clusters. Our extensive experiments, over the standard BEIR collection, demonstrate that DUQGen consistently outperforms all zero-shot baselines and substantially outperforms the SOTA baselines on 16 out of 18 datasets, for an average of 4% relative improvement across all datasets. We complement our results with a thorough analysis for more in-depth understanding of the proposed method's performance and to identify promising areas for further improvements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Expand, highlight, generate: RL-driven document generation for passage reranking. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10087–10099, Singapore. Association for Computational Linguistics.
  2. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 355–362, Edinburgh, Scotland, UK. Association for Computational Linguistics.
  3. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
  4. Inpars: Unsupervised dataset generation for information retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2387–2392, New York, NY, USA. Association for Computing Machinery.
  5. Domain separation networks.
  6. Language models are few-shot learners.
  7. Jaime Carbonell and Jade Goldstein. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, page 335–336, New York, NY, USA. Association for Computing Machinery.
  8. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
  9. Cross domain regularization for neural ranking models using adversarial learning. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1025–1028, New York, NY, USA. Association for Computing Machinery.
  10. Promptagator: Few-shot dense retrieval from 8 examples.
  11. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 113–122, New York, NY, USA. Association for Computing Machinery.
  12. P3 ranker: Mitigating the gaps between pre-training and ranking fine-tuning with prompt-based learning and pre-finetuning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 1956–1962, New York, NY, USA. Association for Computing Machinery.
  13. Unsupervised dense information retrieval with contrastive learning.
  14. Inpars-v2: Large language models as efficient dataset generators for information retrieval.
  15. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  16. Impact of tokenization, pretraining task, and transformer depth on text ranking. In TREC.
  17. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  18. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 39–48, New York, NY, USA. Association for Computing Machinery.
  19. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362.
  20. Pretrained transformers for text ranking: Bert and beyond.
  21. Ms-shift: An analysis of ms marco distribution shifts on neural retrieval. In Advances in Information Retrieval, pages 636–652, Cham. Springer Nature Switzerland.
  22. Fine-tuning llama for multi-stage text retrieval.
  23. Bhaskar Mitra and Nick Craswell. 2018. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval, 13(1):1–126.
  24. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  25. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718.
  26. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718, Online. Association for Computational Linguistics.
  27. Multi-stage document ranking with bert.
  28. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  29. Domain divergences: A survey and empirical analysis. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1830–1849, Online. Association for Computational Linguistics.
  30. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
  31. Improving passage retrieval with zero-shot question generation. In Conference on Empirical Methods in Natural Language Processing.
  32. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100.
  33. Shuo Sun and Kevin Duh. 2020. CLIRMatrix: A massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4160–4170, Online. Association for Computational Linguistics.
  34. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  35. Robert L. Thorndike. 1953. Who belongs in the family? Psychometrika, 18:267–276.
  36. Llama 2: Open foundation and fine-tuned chat models.
  37. GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2345–2360, Seattle, United States. Association for Computational Linguistics.
  38. Finetuned language models are zero-shot learners.
  39. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  40. Approximate nearest neighbor negative contrastive learning for dense text retrieval. ArXiv, abs/2007.00808.
  41. Idst at trec 2019 deep learning track: Deep cascade ranking with generation-based document expansion and pre-trained language modeling. In TREC.
  42. Calibrate before use: Improving few-shot performance of language models.
  43. Peide Zhu and Claudia Hauff. 2022. Unsupervised domain adaptation for question generation with DomainData selection and self-training. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2388–2401, Seattle, United States. Association for Computational Linguistics.
Citations (3)

Summary

  • The paper introduces a novel approach that automatically generates diverse synthetic query-document pairs to adapt neural rankers to specialized domains.
  • It employs domain document clustering, few-shot LLM prompting, and negative pair mining to create high-quality training data without manual curation.
  • Experimental evaluations on 18 BEIR datasets reveal a consistent 4% improvement over state-of-the-art methods.

DUQGen: Unsupervised Domain Adaptation for Neural Rankers through Diversified Query Generation

Introduction to DUQGen

The traditional approach to neural ranking involves training LLMs on extensive, general datasets like MS-MARCO, allowing them to learn domain-general features that can be applied across various tasks without domain adaptation. However, the performance of these models can often suffer when applied to specialized domains due to the lack of domain-specific training data. DUQGen addresses the challenges associated with acquiring high-quality, domain-specific training data for neural ranker fine-tuning by automating the generation of synthetic query-document pairs that are both representative of and diverse within the target domain. This approach overcomes the limitations of prior unsupervised domain adaptation methods, consistently improving upon state-of-the-art (SOTA) baselines across 16 out of 18 datasets within the BEIR benchmark.

Methodology behind DUQGen

DUQGen's methodology comprises four main components:

  1. Domain Document Selection: This involves clustering documents within the target domain to capture representative samples. Probabilistic sampling over these clusters then ensures the selection of a diverse set of documents.
  2. Synthetic Query Generation: Utilizing a pre-trained LLM, DUQGen generates synthetic queries from the sampled documents. This process is steered by a few-shot prompting approach, which leverages in-domain examples to produce more relevant queries.
  3. Negative Pairs Mining: To build a comprehensive training dataset, DUQGen generates negative query-document pairs by leveraging first-stage retrievers like BM25, ensuring the training set contains a balanced mix of positive and negative examples.
  4. Fine-tuning with Synthetic Data: The generated synthetic dataset is used to fine-tune any pre-trained neural ranker, adapting it to the specific target domain without necessitating manual data curation.

Comprehensive Evaluation

DUQGen was evaluated against various baselines using the BEIR benchmark, a collection of 18 diverse datasets covering different domains and tasks. The experiments demonstrated that DUQGen consistently outperforms zero-shot baselines and SOTA unsupervised domain adaptation methods, achieving an average of 4% relative improvement across all datasets. Notably, DUQGen's efficiency was highlighted by its ability to generate a smaller but more effective training dataset, significantly outperforming methods that rely on larger synthetic datasets for fine-tuning.

Theoretical and Practical Implications

  • Improving Domain Adaptation: DUQGen offers a novel and effective approach to unsupervised domain adaptation for neural ranking, addressing the challenges associated with acquiring domain-specific training data. This has significant implications for the development of more adaptable and efficient neural ranking models.
  • Reducing Dependency on Large Synthetic Datasets: By focusing on the quality rather than the quantity of synthetic training data, DUQGen reduces the computational resources required for fine-tuning neural rankers, making the domain adaptation process more accessible.

Future Directions

The success of DUQGen in leveraging LLMs for generating representative and diverse synthetic training data opens up new avenues for research, particularly in exploring other domains where acquiring labeled data is challenging. Future work could also entail refining the clustering and sampling techniques to further enhance the representativeness and diversity of the synthetic training data, potentially leading to even greater improvements in domain-adapted neural ranking performance.

Conclusion

DUQGen sets a new benchmark for unsupervised domain adaptation in neural ranking, offering a scalable and effective solution to the challenges of fine-tuning neural rankers for specialized domains. By generating high-quality, domain-specific training data, DUQGen enables significant improvements in ranking performance with minimal manual effort, signifying a substantial advancement in the field of information retrieval.