A Comparison of Methods for Evaluating Generative IR (2404.04044v2)
Abstract: Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a LLM might directly generate responses without consulting a retrieval component. While there are multiple definitions of generative information retrieval (Gen-IR) systems, in this paper we focus on those systems where the system's response is not drawn from a fixed collection of documents or passages. The response to a query may be entirely new text. Since traditional IR evaluation methods break down under this model, we explore various methods that extend traditional offline evaluation approaches to the Gen-IR context. Offline IR evaluation traditionally employs paid human assessors, but increasingly LLMs are replacing human assessment, demonstrating capabilities similar or superior to crowdsourced labels. Given that Gen-IR systems do not generate responses from a fixed set, we assume that methods for Gen-IR evaluation must largely depend on LLM-generated labels. Along with methods based on binary and graded relevance, we explore methods based on explicit subtopics, pairwise preferences, and embeddings. We first validate these methods against human assessments on several TREC Deep Learning Track tasks; we then apply these methods to evaluate the output of several purely generative systems. For each method we consider both its ability to act autonomously, without the need for human labels or other input, and its ability to support human auditing. To trust these methods, we must be assured that their results align with human assessments. In order to do so, evaluation criteria must be transparent, so that outcomes can be audited by human assessors.
- Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2 (November 2008), 9–15.
- Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers. In 46th European Conference on Information Retrieval. Glasgow, Scotland.
- Quantifying ranker coverage of different query subspaces. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2298–2302.
- Negar Arabzadeh and Charles LA Clarke. 2024. Fr\\\backslash\’echet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels. arXiv preprint arXiv:2401.17543 (2024).
- A is for adele: An offline evaluation metric for instant search. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 3–12.
- Shallow pooling for sparse labels. Information Retrieval Journal 25, 4 (2022), 365–385.
- Relevance assessment: Are judges exchangeable and does it matter. In 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore, 667–674.
- Gen-IR@SIGIR 2023: The First Workshop on Generative Information Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3460–3463.
- Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models. (2023). arXiv:cs.CL/2212.08037
- Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
- Here or there: Preference Judgments for Relevance. Computer Science Department Faculty Publication Series 46. University of Massachusetts Amherst.
- Continual Learning for Generative Retrieval over Dynamic Corpora. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 306–315.
- A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1448–1457.
- The First Workshop on Personalized Generative AI @ CIKM 2023: Personalization Meets Large Language Models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 5267–5270.
- Deep reinforcement learning from human preferences. (2023). arXiv:stat.ML/1706.03741
- Overview of the TREC 2010 Web Track. In 19th Text REtrieval Conference. Gaithersburg, Maryland.
- Assessing top-k𝑘kitalic_k preferences. ACM Transactions on Information Systems 39, 3 (July 2021).
- Cyril W. Cleverdon. 1991. The significance of the Cranfield tests on index languages. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 3–12.
- Overview of the TREC 2019 Deep Learning Track. In 28th Text REtrieval Conference. Gaithersburg, Maryland.
- Overview of the TREC 2020 Deep Learning Track. In 29th Text REtrieval Conference. Gaithersburg, Maryland.
- MS MARCO: Benchmarking Ranking Models in the Large-Data Regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1566–1576.
- TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2369–2375.
- Perspectives on Large Language Models for Relevance Judgment. In ACM SIGIR International Conference on Theory of Information Retrieval. 39–50.
- FactKB: Generalizable Factuality Evaluation using Language Models Enhanced with Factual Knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 933–952.
- Evaluating Generative Ad Hoc Information Retrieval. (2023). arXiv:cs.IR/2311.04694
- ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120, 30 (July 2023). https://doi.org/10.1073/pnas.2305016120
- Retrieving Supporting Evidence for Generative Question Answering. (December 2023).
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20, 4 (2002), 422–446.
- Active Retrieval Augmented Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 7969–7992.
- User Intent and Assessor Disagreement in Web Search Evaluation. In 22nd ACM International Conference on Information and Knowledge Management. San Francisco, California, 699–708.
- Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. SIGIR Forum 45, 2 (January 2012), 66–75.
- Evaluating Verifiability in Generative Search Engines. (2023). arXiv:cs.CL/2304.09848
- WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4549–4560.
- Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2230–2235.
- On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Transactions on Information Systems 35, 3 (January 2017).
- Rethinking search: Making domain experts out of dilettantes. SIGIR Forum 55, 1, Article 13 (July 2021), 27 pages.
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems. 27730–27744.
- Nikita Pavlichenko and Dmitry Ustalov. 2022. Best Prompts for Text-to-Image Models and How to Find Them. arXiv preprint arXiv:2209.11711 (2022).
- How Does Generative Retrieval Scale to Millions of Passages?. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1305–1321.
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. (2023). arXiv:cs.IR/2306.17563
- A nugget-based test collection construction paradigm. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 1945–1948.
- How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910 (2020).
- Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 359–368.
- WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore, 2387–2413.
- REPLUG: Retrieval-Augmented Black-Box Language Models. (2023). arXiv:cs.CL/2301.12652
- Learning to Tokenize for Generative Retrieval. (2023). arXiv:cs.IR/2304.04171
- Large language models can accurately predict searcher preferences. (2023). arXiv:cs.IR/2309.10621
- LLaMA: Open and Efficient Foundation Language Models. (2023). arXiv:cs.CL/2302.13971
- Ellen M. Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, 315–323.
- Ellen M. Voorhees. 2018. On Building Fair and Reusable Test Collections using Bandit Techniques. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 407–416.
- Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models? (2022). arXiv:cs.IR/2201.11086
- Preference-based evaluation metrics for web image search. In 43st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Xi’an, China.
- Auto Search Indexer for End-to-End Document Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.).
- Generate rather than Retrieve: Large Language Models are Strong Context Generators. (2023). arXiv:cs.CL/2209.10063
- Conversational Information Seeking. (2023). arXiv:cs.IR/2201.08808
- Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 10–17.
- BERTScore: Evaluating Text Generation with BERT. (2020). arXiv:cs.CL/1904.09675