- The paper introduces a three-phase retrieval pipeline that incorporates explicit user feedback to guide query expansion and neural re-ranking.
- It employs BM25, a kNN re-ranker, and a bias-only fine-tuned Cross-Encoder, achieving up to a 5.2% nDCG@20 boost through rank fusion.
- The approach is both efficient and scalable, making per-query fine-tuning practical for interactive information retrieval systems.
The paper addresses the integration of explicit relevance feedback into neural re-ranking architectures for information-seeking retrieval. Information-seeking scenarios often involve complex, exploratory queries for which users typically have an easier time providing document-level feedback than constructing optimal queries. The proposed framework combines advancements in few-shot learning, parameter-efficient fine-tuning, and retrieval architectures, with an explicit focus on adapting neural models to leverage small amounts of user feedback per query.
Methodological Contributions
The authors define a three-phase, few-shot retrieval and re-ranking pipeline:
- Initial Retrieval and Feedback Collection: BM25 retrieves candidate documents for the user query. Users provide explicit relevance judgments on a small set (k relevant, k non-relevant) of these.
- Query Expansion and Few-Shot Fine-Tuning: Expansion terms, based on the feedback documents, are appended to the original query. BM25 is used again to retrieve a refined set. Simultaneously, neural re-rankers are adapted using the feedback set.
- Final Re-Ranking and Rank Fusion: The refined candidate set is re-ranked using either a similarity-based kNN method or a fine-tuned Cross-Encoder re-ranker. Reciprocal rank fusion merges BM25 and neural scores for improved results.
The investigation includes:
- A kNN Re-ranker, where semantic similarity (using MiniLM embeddings) between candidate documents, the query, and feedback documents yields ranking scores. No further model adaptation is required, yielding high efficiency.
- A Cross-Encoder (CE) Re-ranker, using MiniLM fine-tuned with either standard few-shot (bias-only) gradient steps or meta-learning (MAML) per-query. Fine-tuning is restricted to bias terms, maintaining a small memory footprint.
- Reciprocal Rank Fusion (RRF) between neural and lexical rankers to combine orthogonal strengths.
The system is evaluated on four dense relevance-verification datasets (Robust04, TREC-COVID, TREC-News, Webis-Touché) transformed into few-shot, feedback-based search tasks.
Empirical Findings
Quantitative Results
- BM25 with query expansion (BM25-QE) performs strongly, providing a robust lexical baseline.
- The vanilla kNN and Cross-Encoder approaches, without fine-tuning, do not outperform BM25-QE except on certain datasets.
- With query-specific fine-tuning (bias-only) of the CE and MAML-based meta-learning, modest but consistent gains are observed over the neural zero-shot baseline and, in some cases, over BM25-QE.
- The fusion of neural and lexical ranks using RRF provides the highest gains, with an average nDCG@20 improvement of 5.2% over the best individual model.
- Additional ablation demonstrates that fine-tuning only the bias terms results in a negligible drop in performance (<1%), supporting the approach's memory efficiency.
Efficiency and Scalability
- The kNN re-ranker adds minimal latency beyond BM25-QE.
- Cross-Encoder inference dominates total latency, but per-query fine-tuning is a minor cost (≈ 22% of re-ranking time), suggesting its feasibility in interactive systems if the number of re-ranked documents is constrained.
- The parameter-efficient approach makes it practical to store many per-query models in memory, addressing scalability for multi-session or multi-user deployments.
Robustness and Generalizability
- Performance improves with more explicit feedback, but is still robust to minimal user input.
- Rank fusion consistently mitigates dataset-specific weaknesses of BM25 and neural methods. For instance, fusion yields high overlap in top-20 rankings, yet maintains diversity, improving overall recall and precision.
Theoretical and Practical Implications
The paper demonstrates that direct, per-query adaptation of neural re-rankers using explicit relevance feedback is competitive with state-of-the-art lexical query expansion, while offering strong potential for further improvement via rank fusion. The success of bias-only parameter updates is particularly noteworthy, indicating that much of the neural model's retrieval capacity can be quickly and efficiently tuned for specific information needs without full model retraining.
Practically, this architecture makes explicit feedback collection genuinely valuable in deployed IR systems for scientific, legal, and news search, where user willingness for feedback is higher and retrieval precision is crucial.
Limitations and Future Extensions
- The framework requires explicit feedback, which, while feasible in professional domains, may limit applicability in high-throughput web search.
- Training/fine-tuning is conducted with datasets transformed to simulate feedback; in-the-wild user studies remain an open area.
- The approach is not experimented in iterative, multi-stage feedback settings, where further user interaction might bring diminishing, but substantive, gains.
- Generalization to new domains without in-domain fine-tuning may be limited; future avenues include integrating unsupervised domain adaptation.
- Although per-query fine-tuning is efficient in terms of parameters, latency remains higher than pure BM25 for large candidate sets.
Outlook
This work underscores the value of integrating small, explicit user feedback signals directly into neural ranking architectures using parameter- and data-efficient methods. The empirical advantage and practical feasibility of bias-only fine-tuning, alongside efficient similarity-based approaches, broadens the applicability of neural IR approaches in practice. Future research will likely explore richer feedback interaction (including implicit and multi-round), deeper integration with model architectures supporting fast-adaptation (e.g., adapter layers), and more extensive deployment studies in live systems.