Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction (2203.13088v1)

Published 24 Mar 2022 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: Recent progress in neural information retrieval has demonstrated large gains in effectiveness, while often sacrificing the efficiency and interpretability of the neural model compared to classical approaches. This paper proposes ColBERTer, a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction. Along the effectiveness Pareto frontier, ColBERTer's reductions dramatically lower ColBERT's storage requirements while simultaneously improving the interpretability of its token-matching scores. To this end, ColBERTer fuses single-vector retrieval, multi-vector refinement, and optional lexical matching components into one model. For its multi-vector component, ColBERTer reduces the number of stored vectors per document by learning unique whole-word representations for the terms in each document and learning to identify and remove word representations that are not essential to effective scoring. We employ an explicit multi-task, multi-stage training to facilitate using very small vector dimensions. Results on the MS MARCO and TREC-DL collection show that ColBERTer can reduce the storage footprint by up to 2.5x, while maintaining effectiveness. With just one dimension per token in its smallest setting, ColBERTer achieves index storage parity with the plaintext size, with very strong effectiveness results. Finally, we demonstrate ColBERTer's robustness on seven high-quality out-of-domain collections, yielding statistically significant gains over traditional retrieval baselines.

Citations (29)

View on Semantic Scholar

Summary

The paper introduces ColBERTer, which fuses single-vector and multi-vector retrieval to achieve a 2.5× reduction in vector storage compared to traditional models.
The study employs innovative Bag of Whole-Words and Contextualized Stopwords to enhance efficiency and interpretability by linking scoring to whole-word representations.
The model demonstrates robust, state-of-the-art retrieval performance on benchmarks and offers versatile deployment options through optimized Margin-MSE training and multi-task learning.

Introduction

The landscape of neural information retrieval (IR) has rapidly evolved with the introduction of pre-trained LLMs such as BERT, which bolster retrieval quality at the cost of model efficiency and interpretability. ColBERTer is a neural retrieval model that addresses these challenges by fusing a single-vector retrieval with a multi-vector refinement model, reinforced by explicit multi-task training. The primary objective of ColBERTer is to significantly decrease storage requirements without compromising effectiveness.

Efficient Representations

ColBERTer introduces reduced storage by proposing two novel components: Bag of Whole-Words (BOW) and Contextualized Stopwords (CS). The idea is to represent documents in terms of unique whole words rather than all subword tokens. This change alone contributes to storing 2.5× fewer vectors compared to traditional ColBERT models. Additionally, CS aids in the removal of uninformative words at the encoding phase, further pruning the token storage.

Enhanced Interpretability

ColBERTer enhances interpretability by connecting the scoring mechanism directly to whole-word representations instead of subword tokens. This transparency allows end-users to comprehend the model's token-matching process intuitively; a substantial benefit when clear rationale for model decisions is required, such as in a regulatory context or for models demanded to exhibit fairness and transparency.

Training and Effectiveness

The model leverages a unique training workflow utilizing Margin-MSE loss. It capitalizes on multi-task learning involving two weighted loss functions that cater to retrieval and refinement. Empirical testing shows that tuning these weights appropriately leads to consistent retrieval performance, and surprisingly, presents robustness to small hyperparameter changes. For example, with a tuned score aggregation, the CLS vector alone achieves competitive retrieval results, while combined with token scoring, it results in state-of-the-art performance.

Deployment Versatility

ColBERTer offers multiple deployment scenarios ranging from hybrid retrieval-refinement modes to simplified retrieval methods that leverage either a sparse or dense index exclusively. This flexibility allows practitioners to adapt the model to existing infrastructures, reducing setup complexity.

Evaluation and Robustness

The model was rigorously tested against standard benchmarks such as MS MARCO and TREC-DL collections. Notably, ColBERTer maintains its retrieval effectiveness while substantially reducing index requirements. Additionally, the model's robustness was confirmed through zero-shot out-of-domain tests, revealing no single collection with significantly worse results when pitted against other models such as TAS-B.

Conclusion

ColBERTer's approach is a testament to the premise that it is indeed feasible to reduce storage overhead while maintaining, if not improving, the quality of neural retrieval systems. The model presents as a solid candidate for applications requiring an effective balance between efficiency, effectiveness, and interpretability in IR tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jobergum/status/1750282248811577626

https://twitter.com/lateinteraction/status/1743308321942384868