Scaling Laws For Dense Retrieval (2403.18684v2)
Abstract: Scaling up neural models has yielded significant advancements in a wide array of tasks, particularly in language generation. Previous studies have found that the performance of neural models frequently adheres to predictable scaling laws, correlated with factors such as training set size and model size. This insight is invaluable, especially as large-scale experiments grow increasingly resource-intensive. Yet, such scaling law has not been fully explored in dense retrieval due to the discrete nature of retrieval metrics and complex relationships between training data and model sizes in retrieval tasks. In this study, we investigate whether the performance of dense retrieval models follows the scaling law as other neural models. We propose to use contrastive log-likelihood as the evaluation metric and conduct extensive experiments with dense retrieval models implemented with different numbers of parameters and trained with different amounts of annotated data. Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations. Additionally, we examine scaling with prevalent data augmentation methods to assess the impact of annotation quality, and apply the scaling law to find the best resource allocation strategy under a budget constraint. We believe that these insights will significantly contribute to understanding the scaling effect of dense retrieval models and offer meaningful guidance for future research endeavors.
- Lada A Adamic and Bernardo A Huberman. 2002. Zipf’s law and the Internet. Glottometrics 3, 1 (2002), 143–150.
- Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 139–149.
- Understanding Scaling Laws for Recommendation Models. arXiv preprint arXiv:2208.08489 (2022).
- Unified scaling laws for routed language models. In International Conference on Machine Learning. PMLR, 4057–4086.
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning. PMLR, 7480–7512.
- Luyu Gao and Jamie Callan. 2021a. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253 (2021).
- Luyu Gao and Jamie Callan. 2021b. Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540 (2021).
- Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496 (2022).
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR, 10835–10866.
- Alexander Gelbukh and Grigori Sidorov. 2001. Zipf and Heaps laws’ coefficients depend on language. In Computational Linguistics and Intelligent Text Processing: Second International Conference, CICLing 2001 Mexico City, Mexico, February 18–24, 2001 Proceedings 2. Springer, 332–335.
- Semantic models for the first-stage retrieval: A comprehensive review. arXiv preprint arXiv:2103.04831 (2021).
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017).
- Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
- Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 113–122.
- Towards unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021).
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
- Jonathan C Lansey and Bruce Bukiet. 2009. Internet Search Result Probabilities: Heaps’ Law and Word Associativity. Journal of Quantitative Linguistics 16, 1 (2009), 40–66.
- Pretrained transformers for text ranking: Bert and beyond. Synthesis Lectures on Human Language Technologies 14, 4 (2021), 1–325.
- Zheng Liu and Yingxia Shao. 2022. RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder. arXiv preprint arXiv:2205.12035 (2022).
- Zipf’s law leads to Heaps’ law: Analyzing their relation in finite-size systems. PloS one 5, 12 (2010), e14139.
- Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics 9 (2021), 329–345.
- Pre-Train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 848–858. https://doi.org/10.1145/3477495.3531772
- Incorporating Structural Information into Legal Case Retrieval. ACM Transactions on Information Systems 42, 2 (2023), 1–28.
- Mark EJ Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary physics 46, 5 (2005), 323–351.
- MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899 (2021).
- From doc2query to docTTTTTquery. Online preprint 6 (2019), 2.
- Combined scaling for zero-shot transfer learning. Neurocomputing 555 (2023), 126658.
- RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191 (2020).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
- No parameter left behind: How distillation and model size affect zero-shot retrieval. arXiv preprint arXiv:2206.02873 (2022).
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. arXiv:2112.01488 [cs.IR]
- Understanding Relevance Judgments in Legal Case Retrieval. ACM Transactions on Information Systems 41, 3 (2023), 1–32.
- Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 3645–3650. https://doi.org/10.18653/v1/P19-1355
- T2Ranking: A large-scale Chinese Benchmark for Passage Ranking. arXiv preprint arXiv:2304.03679 (2023).
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020).
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12104–12113.
- Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1503–1512.
- Scaling Law of Large Sequential Recommendation Models. arXiv preprint arXiv:2311.11351 (2023).
- Yan Fang (20 papers)
- Jingtao Zhan (17 papers)
- Qingyao Ai (113 papers)
- Jiaxin Mao (47 papers)
- Weihang Su (27 papers)
- Jia Chen (85 papers)
- Yiqun Liu (131 papers)