Emergent Mind

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting

(2306.17563)

Published Jun 30, 2023 in cs.IR , cs.CL , and cs.LG

Abstract

Ranking documents using LLMs by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10" rel="nofollow noopener">NDCG@10. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity.

Overview

Introduces Pairwise Ranking Prompting (PRP), a novel technique to improve LLMs' document ranking performance.
Highlights the limitations of traditional pointwise and listwise ranking methods with standard LLMs.
Describes how PRP simplifies the ranking task by evaluating document pairs in relation to a query, enhancing robustness and reducing complexity.
Reports that PRP enables smaller LLMs to outperform much larger models on standard benchmarks, including the large commercial model GPT-4.
Demonstrates PRP's potential as an efficient and cost-effective text ranking solution that doesn't require model fine-tuning.

Introduction to Pairwise Ranking Prompting

Finding effective methods for document ranking with LLMs has been a prominent challenge in the field of natural language processing. Document ranking is a specialized task requiring models to order documents by their relevance to a given query. Historically, systems built with LLMs have struggled to match the performance of the more traditional fine-tuned rankers when evaluated on benchmark datasets. This paper introduces a novel technique, Pairwise Ranking Prompting (PRP), which significantly improves LLM performance on document ranking tasks and offers a fresh perspective on efficient ranking with LLMs.

Unpacking the Challenges

Document ranking using LLMs traditionally involves either pointwise or listwise approaches. Pointwise methods require LLMs to generate calibrated probabilities for relevance, a tricky task not fully supported by standard LLMs such as GPT-4 or InstructGPT. Listwise methods create additional complications, often resulting in LLMs giving conflicting or redundant outputs. These methodologies presuppose an understanding of ranking during the training phase, something generally not provided in the pre-training or fine-tuning of popular LLMs. Subsequently, these LLMs exhibit limitations in text ranking capabilities.

Introduction of PRP

To tackle these issues, PRP is proposed, which simplifies the task for LLMs by only considering pairs of documents at a time in relation to a query. PRP focuses on a less burdensome and more intuitive approach: asking the LLM to decide which of two documents is more relevant. This simplification leads to several advantages. It supports both generation and scoring APIs, is insensitive to the input order—a frequent complication in listwise prompting—and can deliver impressive results even when scaled down to models with linear complexity in terms of API calls.

Achievements and Comparative Analysis

The results of deploying PRP are remarkable. The approach propels moderate-sized, openly-sourced LLMs such as Flan-UL2 with 20B parameters to outperform much larger models. To illustrate, PRP achieves state-of-the-art performance against the blackbox commercial GPT-4, which is estimated to have a model size 50 times larger, improving over 5% at NDCG@1 on standard benchmarks like TREC-DL2020. It also surpasses InstructGPT with 175B parameters in nearly all standard ranking metrics. These findings substantiate the efficacy of PRP in obtaining superior ranking performance while also highlighting its adaptability and potential for resource-constrained research.

In summary, the paper demonstrates that PRP is not only effective for zero-shot ranking with LLMs but also presents a viable alternative that leverages the benefits of smaller, widely available LLMs. The technique's simplicity, efficiency, and disregard for model fine-tuning make it a groundbreaking approach for academic and practical applications in document ranking. It sets a new precedent in text ranking research, showcasing that the scaling down of model complexities might be as effective as employing massive, proprietary LLMs - offering the research community an accessible and cost-effective avenue to further explore and improve upon.