Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-Ranking
Abstract: Cross-encoders distilled from LLMs are often more effective re-rankers than cross-encoders fine-tuned on manually labeled data. However, distilled models do not match the effectiveness of their teacher LLMs. We hypothesize that this effectiveness gap is due to the fact that previous work has not applied the best-suited methods for fine-tuning cross-encoders on manually labeled data (e.g., hard-negative sampling, deep sampling, and listwise loss functions). To close this gap, we create a new dataset, Rank-DistiLLM. Cross-encoders trained on Rank-DistiLLM achieve the effectiveness of LLMs while being up to 173 times faster and 24 times more memory efficient. Our code and data is available at https://github.com/webis-de/ECIR-25.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper looks at how to make search results better and faster. Imagine you type a question into a search engine and it finds 100 possible answers (passages). The job is to sort these 100 passages so the best ones appear at the top. Big AI models (called LLMs, or LLMs) are great at this sorting, but they’re slow and expensive. Smaller models (called cross-encoders) are much faster, but they need good training data. The paper shows how to train these smaller models using rankings produced by LLMs so the small models become almost as good as the big ones—while being much faster.
Key questions the paper tries to answer
- Can we train fast, small models (cross-encoders) using rankings from big models (LLMs) so they perform as well as the big models?
- Which training choices matter most: how we pick “tricky negatives,” how deep we rank (how many passages per query we consider), and which loss function we use (a way to measure mistakes during training)?
- How much training data do we actually need to get strong results?
How did the researchers approach this?
Key ideas in everyday terms
- Passage re-ranking: Think of it like sorting a pile of possible answers so the best ones go to the top.
- First-stage retrieval: An initial “filter” that grabs the top 100 likely passages from a huge library.
- Hard negatives: Tricky wrong answers that look similar to the right answer. Training on these makes the model smarter.
- Distillation: The big “teacher” (LLM) shows the small “student” (cross-encoder) how to rank passages. The student learns to mimic the teacher’s rankings.
- Pairwise vs. listwise loss: Two ways to teach ranking. Pairwise compares two items at a time (which should be higher?). Listwise looks at the whole list at once (is the ordering right?).
What they built
- A new training dataset called Rank-DistiLLM. Here’s how they made it:
- Step 1: Use strong first-stage tools (BM25 and ColBERTv2) to fetch the top 100 passages for 10,000 search queries.
- Step 2: Ask a ranking LLM (RankZephyr) to sort these 100 passages.
- Step 3: Create versions of the dataset that only keep the top 10, 25, 50, or 100 passages, to see how “deep” rankings affect training.
- They trained cross-encoder models (monoELECTRA) on this data to mimic the LLM’s rankings.
- They tested different training strategies:
- Which first-stage retriever to use (BM25 vs. ColBERTv2).
- How many passages per query to include.
- Two training styles:
- Single-stage: train only on LLM-made rankings.
- Two-stage: first train on human-labeled data (MS MARCO), then fine-tune on LLM-made rankings.
- Two loss functions:
- Pairwise (RankNet): compare passage pairs.
- A new listwise loss they propose (ADR-MSE): check if the whole ordering matches the teacher, focusing more on getting the top ranks right.
What did they find, and why does it matter?
- Better starting point leads to better results:
- Using a stronger first-stage retriever (ColBERTv2) to pick passages is crucial. It gives the LLM better candidates to rank, which gives the cross-encoder better training data.
- Depth helps—but there’s a sweet spot:
- Training with rankings of about the top 50 passages per query gave the best results. Going deeper to 100 didn’t help more.
- Simple training works best:
- Their new listwise loss (ADR-MSE) did not beat the simpler pairwise loss (RankNet). This means you don’t need complicated loss functions to get great performance when distilling from LLM rankings.
- Two-stage training wins:
- First train on human-labeled data (MS MARCO), then fine-tune on the LLM rankings (Rank-DistiLLM). This approach matched or slightly beat the LLM’s performance in some tests.
- Big wins in speed and efficiency:
- The trained cross-encoder runs in about 300 milliseconds per query versus roughly 25 seconds for the LLM. That’s orders of magnitude faster, making it practical for real search engines.
- Strong out-of-domain performance:
- On new tasks and datasets (outside the training domain), their cross-encoders held up well—often matching LLMs and beating previous smaller models.
Why is this important?
- It shows a clear recipe to get near-LLM ranking quality using a fast, affordable model—great for real-world search systems that need speed.
- It proves that certain training choices matter a lot:
- Use strong “hard negatives” from a powerful first-stage retriever.
- Use rankings of around 50 passages per query.
- Pairwise loss is enough; you don’t need fancy listwise losses.
- Two-stage training (human labels then LLM rankings) gives the best results.
- It provides a new public dataset (Rank-DistiLLM) so others can build on this work.
Bottom line and impact
This paper shows how to turn a powerful but slow “teacher” (LLM) into a fast “student” (cross-encoder) that ranks search results almost as well—at a fraction of the cost and time. With the right data and training steps, search engines can deliver high-quality results quickly, making advanced re-ranking practical for real applications. The released dataset and code help the community push this even further.
Collections
Sign up for free to add this paper to one or more collections.