Emergent Mind

Can We Use Large Language Models to Fill Relevance Judgment Holes?

(2405.05600)
Published May 9, 2024 in cs.IR and cs.CL

Abstract

Incomplete relevance judgments limit the re-usability of test collections. When new systems are compared against previous systems used to build the pool of judged documents, they often do so at a disadvantage due to the ``holes'' in test collection (i.e., pockets of un-assessed documents returned by the new system). In this paper, we take initial steps towards extending existing test collections by employing LLMs (LLM) to fill the holes by leveraging and grounding the method using existing human judgments. We explore this problem in the context of Conversational Search using TREC iKAT, where information needs are highly dynamic and the responses (and, the results retrieved) are much more varied (leaving bigger holes). While previous work has shown that automatic judgments from LLMs result in highly correlated rankings, we find substantially lower correlates when human plus automatic judgments are used (regardless of LLM, one/two/few shot, or fine-tuned). We further find that, depending on the LLM employed, new runs will be highly favored (or penalized), and this effect is magnified proportionally to the size of the holes. Instead, one should generate the LLM annotations on the whole document pool to achieve more consistent rankings with human-generated labels. Future work is required to prompt engineering and fine-tuning LLMs to reflect and represent the human annotations, in order to ground and align the models, such that they are more fit for purpose.

Correlation of human and AI-generated content rankings using top models from TREC iKAT 2023.

Overview

  • The research paper investigates the use of LLMs like ChatGPT and LLaMA to fill relevance judgment gaps in conversational search test collections, which can hinder the accuracy of system evaluations.

  • Findings indicate that while LLMs provide a moderate correlation with human judgments, integrating both human and LLM-generated judgments introduces inconsistencies affecting system evaluations.

  • The study evaluates practical and theoretical implications, suggesting cost efficiency and scalability benefits while pointing out issues in bias and model reliability, also proposing future research directions in prompt engineering, model fine-tuning, and evaluation strategies.

Exploring the Feasibility of Using LLMs for Filling Relevance Judgment Gaps in Conversational Search

Introduction to the Research Study

The research paper explores a critical issue in information retrieval testing: the presence of "holes" or unassessed documents that often hinder accurate system evaluation. These holes arise when new systems retrieve documents not judged in the original test collections used for older systems. The authors investigate whether LLMs, like ChatGPT and LLaMA, can effectively supplement these missing relevance judgments.

Key Findings from the Study

  • Correlation of LLMs with Human Judgments: The study found that relevance judgments generated by LLMs were moderately correlated with human judgments. However, the correlation dipped when both human and LLM-generated judgments were used together, highlighting a disparity in decision-making criteria.
  • Impact of Holes on Model Evaluation: Researchers noted that the presence and size of holes significantly affected the comparative evaluation of new runs. Larger gaps led to a biased assessment favoring or penalizing new models disproportionately.
  • Consistency Across Assessments: For consistent system ranking, the paper suggests that LLM annotations should ideally cover the entire document pool, not just the unjudged documents. This approach reduces inconsistencies introduced by merging human and LLM judgments.

Implications for Information Retrieval

Practical Implications

  • Cost Efficiency: Leveraging LLMs for relevance judgments can significantly lower the costs compared to human assessments, given their ability to process large volumes at a relatively swift pace.
  • Scalability: With LLMs, extending existing test collections to include new documents becomes more feasible, potentially increasing the longevity and relevance of these collections.

Theoretical Implications

  • Model Reliability: The variance in LLM performance underscores the need for more robust models that can align closely with human judgment criteria.
  • Bias and Fairness: The study spotlights potential biases in LLM judgments, prompting further exploration into making these models fair and representative of diverse information needs.

Future Directions in AI and Large Language Model Utilization

The paper suggests several avenues for future research:

  1. Prompt Engineering: Improving how LLMs are queried (prompt engineering) to generate relevance judgments that better mimic human assessments.
  2. Model Fine-Tuning: Customizing LLMs to specific IR tasks might yield judgments that are more in sync with human standards.
  3. Comprehensive Evaluation Strategies: Developing methodologies to systematically evaluate and integrate LLM judgments in IR test collections, ensuring reliability across different system evaluations.

Conclusive Thoughts

This study provides a valuable exploration of using advanced AI tools to address the persistent challenge of incomplete relevance judgments in IR test collections. While the results affirm the potential utility of LLMs in this context, they also highlight critical concerns about consistency and bias. Ensuring that LLM-generated judgments are reliable and fair remains an imperative goal for future research endeavors in this area.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.