Seven Failure Points When Engineering a Retrieval Augmented Generation System (2401.05856v1)
Abstract: Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a LLM such as ChatGPT to extract the right answer using an LLM. RAG systems aim to: a) reduce the problem of hallucinated responses from LLMs, b) link sources/references to generated responses, and c) remove the need for annotating documents with meta-data. However, RAG systems suffer from limitations inherent to information retrieval systems and from reliance on LLMs. In this paper, we present an experience report on the failure points of RAG systems from three case studies from separate domains: research, education, and biomedical. We share the lessons learned and present 7 failure points to consider when designing a RAG system. The two key takeaways arising from our work are: 1) validation of a RAG system is only feasible during operation, and 2) the robustness of a RAG system evolves rather than designed in at the start. We conclude with a list of potential research directions on RAG systems for the software engineering community.
- Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. In 3rd Workshop for Natural Language Processing Open Source Software.
- Self-adaptive Machine Learning Systems: Research Challenges and Opportunities. 133–155. https://doi.org/10.1007/978-3-031-15116-3_7
- Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv preprint arXiv:2309.01431 (2023).
- Efficient Open Domain Multi-Hop Question Answering with Few-Shot Data Synthesis. arXiv preprint arXiv:2305.13691 (2023).
- Threshy: Supporting safe usage of intelligent web services. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1645–1649.
- Beware the evolving ‘intelligent’web service! An integration architecture tactic to guard AI-first components. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 269–280.
- Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938.
- Fid-light: Efficient and effective retrieval-augmented text generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1437–1447.
- Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282 (2020).
- BioASQ-QA: A manually curated corpus for biomedical question answering. Scientific Data 10 (2023), 170. Citation Key: 422.
- LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B. arXiv:2310.20624 [cs.LG]
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172 (2023).
- G-eval: Nlg evaluation using gpt-4 with better human alignment, may 2023. arXiv preprint arXiv:2303.16634 (2023).
- Retrieval-based prompt selection for code-related few-shot learning. In Proceedings of the 45th International Conference on Software Engineering (ICSE’23).
- OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/ARXIV.2303.08774
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
- Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics 11 (2023), 1–17.
- Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).