SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval (2109.10086v1)

Published 21 Sep 2021 in cs.IR, cs.AI, and cs.CL

Abstract: In neural Information Retrieval (IR), ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors methods has proven to work well. Meanwhile, there has been a growing interest in learning \emph{sparse} representations for documents and queries, that could inherit from the desirable properties of bag-of-words models such as the exact matching of terms and the efficiency of inverted indexes. Introduced recently, the SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches. In this paper, we build on SPLADE and propose several significant improvements in terms of effectiveness and/or efficiency. More specifically, we modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation. We also report results on the BEIR benchmark. Overall, SPLADE is considerably improved with more than $9$\% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.

Citations (167)

View on Semantic Scholar

Summary

The paper introduces enhancements to SPLADE by implementing a max pooling mechanism that significantly improves sparse representation retrieval.
It presents SPLADE-doc, a document-only encoder that pre-computes term weights to reduce online inference costs while maintaining robust performance.
The integration of distillation with hard negatives optimizes training, boosting retrieval accuracy and outperforming several recent dense models.

Enhancing SPLADE for Improved Information Retrieval through Sparse Representations

Introduction to SPLADE Enhancements

Recent advancements in the domain of neural Information Retrieval (IR) have been directed towards enhancing the efficacy of first retrievers in ranking pipelines. Among the notable innovations, the SPLADE model has emerged due to its ability to generate highly sparse representations, achieving competitive results against both dense and sparse approaches. In this context, we examine several key updates and innovations to the SPLADE framework that markedly improve its effectiveness and efficiency.

Modifications to SPLADE

Max Pooling Mechanism

One significant update is the modification of the pooling mechanism within SPLADE. Transitioning from sum to max pooling (referred to as SPLADE-max) not only aligns the model more closely with related works like SPARTA and EPIC but also provides substantial performance improvements. These updates underscore the evolving understanding of optimal token pooling strategies in generating sparse representations for IR.

SPLADE Document Encoder

Further expanding the model's capabilities, the introduction of a document-only version of SPLADE, named SPLADE-doc, eliminates the need for query term weighting or expansion. This adjustment leads to a more efficient retrieval process since the document term weights can be pre-computed and indexed, reducing the online inference cost.

Enhanced Training with Distillation

Incorporating distillation into the SPLADE training process marks a pivotal advancement in refining the model's performance. By leveraging distillation techniques, the model is trained using harder negatives generated by SPLADE itself, thereby significantly improving its retrieval accuracy and further diminishing the gap between sparse and dense models in neural IR tasks.

Experimental Insights

The experimental results presented underscore the enhanced performance of the modified SPLADE models. Notably, SPLADE-max demonstrates substantial improvements over the original SPLADE across major metrics, making it competitive with the leading dense retrieval methods. Furthermore, the SPLADE-doc, focused on document encoding, achieves comparable performance to SPLADE despite its simplified approach, highlighting the potential efficiency gains.

In zero-shot evaluation using the BEIR benchmark subset, DistilSPLADE-max notably outperforms several recent models, underscoring its robustness and generalization capabilities in diverse IR contexts. This is particularly compelling given the inherent challenges in zero-shot evaluation and the model's ability to excel across a wide range of datasets.

Implications and Future Directions

The advancements detailed in this examination of SPLADE underscore the significant potential of sparse lexical models in the field of IR. The modifications, notably the max pooling mechanism and the introduction of distillation training, present a compelling case for the continuing evolution of SPLADE towards greater efficiency and effectiveness.

Looking forward, the success of these enhancements opens several avenues for future research. Further exploration into optimal pooling strategies, the role of document encoders in simplifying IR models, and the integration of advanced training methodologies like distillation lay the groundwork for continued innovation in sparse representation learning for IR.

In summary, the ongoing development of SPLADE and its adaptations exemplify the dynamic nature of research in neural IR, highlighting the model's potential to significantly impact both theoretical understanding and practical applications in the field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_aken12/status/1773722134050951564

YouTube

Show All Videos