Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models (2406.14848v1)

Published 21 Jun 2024 in cs.CL and cs.IR

Abstract: Recent studies have demonstrated the effectiveness of using large language LLMs in passage ranking. The listwise approaches, such as RankGPT, have become new state-of-the-art in this task. However, the efficiency of RankGPT models is limited by the maximum context length and relatively high latency of LLM inference. To address these issues, in this paper, we propose PE-Rank, leveraging the single passage embedding as a good context compression for efficient listwise passage reranking. By treating each passage as a special token, we can directly input passage embeddings into LLMs, thereby reducing input length. Additionally, we introduce an inference method that dynamically constrains the decoding space to these special tokens, accelerating the decoding process. For adapting the model to reranking, we employ listwise learning to rank loss for training. Evaluation results on multiple benchmarks demonstrate that PE-Rank significantly improves efficiency in both prefilling and decoding, while maintaining competitive ranking effectiveness. {The Code is available at \url{https://github.com/liuqi6777/pe_rank}.}

Authors (4)

Qi Liu (487 papers)
Bo Wang (825 papers)
Nan Wang (147 papers)
Jiaxin Mao (47 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces PE-Rank, a method that uses compressed passage embeddings to reduce context length and accelerate listwise reranking with LLMs.
It employs Dynamic-Constrained Decoding and a dual-stage training process to align retrieval embeddings with the LLM input space.
Experimental results on TREC DL and BEIR benchmarks show a 4.5x latency improvement with less than a 2% drop in ranking performance.

Leveraging Passage Embeddings for Efficient Listwise Reranking with LLMs

"Leveraging Passage Embeddings for Efficient Listwise Reranking with LLMs" presents a novel methodology, PE-Rank, for addressing the inefficiencies inherent in current listwise reranking approaches such as RankGPT. The paper delineates the integration of passage embeddings as compressed context representations to improve the efficiency of listwise passage reranking in information retrieval (IR) tasks.

Introduction to Passage Reranking

Passage ranking optimizes the relevance of documents in response to user queries and is critical in applications like web search. State-of-the-art methodologies typically follow a two-step process: first, a dense retrieval stage identifies candidate passages using a bi-encoder architecture, then a reranker model refines this list for better performance. Traditional rerankers leveraging LLMs encounter limitations due to context length constraints and high inference latency.

PE-Rank: Methodology

PE-Rank proposes a significant shift by utilizing passage embeddings instead of full text passages. These embeddings serve as succinct representations, mitigating latency and context length issues:

Context Compression: Passage embeddings are treated as special tokens, enabling their direct input into LLMs. This reduces the input length significantly.
Dynamic-Constrained Decoding (DC Decoding): This novel decoding strategy constrains the decoding space to special tokens representing embeddings, enhancing inference speed by focusing only on relevant tokens.

Training Process

PE-Rank's training comprises two stages:

Alignment Stage: Here, the alignment of retrieval model embeddings with the LLM’s input embedding space is achieved using a dense retrieval model and a two-layer MLP projector.
Learning-to-Rank Stage: This employs a dual-strategy training scheme involving token interactions and KL Divergence to distill knowledge from detailed textual inputs to their compressed embedding representations. This ensures that the reranker can interpret and utilize passage embeddings effectively.

Experimental Setup and Results

Evaluations were conducted on TREC DL and BEIR benchmarks. PE-Rank demonstrated efficient performance improvement while maintaining effectiveness comparable to state-of-the-art models. Key metrics and findings include:

Efficiency: PE-Rank reduced the number of processed and generated tokens significantly. For instance, on the TREC DL19 dataset, PE-Rank improved latency by a factor of 4.5 compared to uncompressed models.
Effectiveness: Experimental results indicated a marginal decrease in ranking performance (less than 2%) when compared to uncompressed methods, highlighting the efficiency gains without substantial compromises on effectiveness.

Implications and Future Research

PE-Rank's approach offers practical improvements for IR systems constrained by computational resources. By simplifying the reranking process while maintaining high accuracy, this method represents a crucial step towards scalable and efficient IR tasks. Future research could explore:

Adaptive Embedding Models: Further adaptations of the MLP and LLM to different embedding models, advancing robustness and versatility.
Extended Compression Techniques: Innovations in compression strategies to balance between context understanding and efficiency even further.
Broader Benchmarking: More extensive evaluations with larger LLMs, embedding models, and diverse datasets to validate the scalability and generalization of PE-Rank.

Overall, PE-Rank serves as a significant stride towards resolving inherent limitations in LLM-based rerankers, promoting a balanced approach between efficiency and effectiveness. This research underscores the potential to innovatively solve latency and context constraints, enhancing the application of LLMs in real-world IR systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JinaAI_/status/1806331488247402524

https://twitter.com/_reachsumit/status/1805070093841564158

https://twitter.com/gm8xx8/status/1805070593357946944

https://twitter.com/jpbalarini/status/1808896176713822519

YouTube

Show All Videos