ChatQA: Surpassing GPT-4 on Conversational QA and RAG (2401.10225v5)
Abstract: In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). To enhance generation, we propose a two-stage instruction tuning method that significantly boosts the performance of RAG. For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models, while substantially reducing deployment costs. We also present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions. Our ChatQA-1.0-70B (score: 54.14), built on Llama2, a weaker foundation model than GPT-4, can slightly outperform GPT-4-0613 (score: 53.90) and GPT-4-Turbo-2024-04-09 (score: 54.03) on the ChatRAG Bench, without relying on any synthetic data from OpenAI GPT models. Notably, the Llama3-ChatQA-1.5-70B model surpasses the accuracy of GPT-4-Turbo-2024-04-09, achieving a 4.4% improvement. To advance research in this field, we open-sourced the model weights, instruction tuning data, ChatRAG Bench, and retriever for the community: https://chatqa-project.github.io/.
- Topiocqa: Open-domain conversational question answering with topic switching. TACL, 2022.
- Building and evaluating open-domain dialogue corpora with clarifying questions. In EMNLP, 2021.
- Open-domain question answering goes conversational via question rewriting. In NAACL, 2021.
- Anthropic. Introducing 100k context windows, 2023a.
- Anthropic. Introducing Claude, 2023b.
- Coqar: Question rewriting on coqa. In LREC, 2022.
- Doqa-accessing domain-specific faqs via conversational qa. In ACL, 2020.
- Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. In EMNLP, 2022a.
- Reinforced question rewriting for conversational question answering. In EMNLP, 2022b.
- Quac: Question answering in context. In EMNLP, 2018.
- How to ask better questions? a large-scale multi-domain dataset for rewriting ill-formed questions. In AAAI, 2020.
- Scaling instruction-finetuned language models. arXiv preprint arXiv: 2210.11416, 2022.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023a.
- Free Dolly: Introducing the world’s first truly open instruction-tuned llm, 2023b.
- Dialog inpainting: Turning documents to dialogs. In ICML, 2022.
- Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP, 2019.
- Question rewriting for open-domain conversational qa: Best practices and limitations. In CIKM, 2021.
- Pacific: Towards proactive conversational question answering over tabular and textual data in finance. In EMNLP, 2022.
- Glm: General language model pretraining with autoregressive blank infilling. In ACL, 2022.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL, 2019.
- Can you unpack that? learning to rewrite questions-in-context. In EMNLP, 2019.
- Eli5: Long form question answering. In ACL, 2019.
- doc2dial: A goal-oriented document-grounded dialogue dataset. In EMNLP, 2020.
- Rewriting conversational utterances with instructed large language models. In IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, 2023.
- Unigdd: A unified generative framework for goal-oriented document-grounded dialogue. In ACL, 2022.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
- Google. Introducing bard, 2023.
- Abg-coqa: Clarifying ambiguity in conversational question answering. In AKBC, 2021.
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
- Can question rewriting help conversational question answering? In Proceedings of the Third Workshop on Insights from Negative Results in NLP, 2022.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
- Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021.
- Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022.
- Soda: Million-scale dialogue distillation with social commonsense contextualization. arXiv preprint arXiv:2212.10465, 2022.
- The narrativeqa reading comprehension challenge. TACL, 2018.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv: 2304.07327, 2023.
- Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, 2019.
- How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452, 2023a.
- Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352, 2023b.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Adaptive utterance rewriting for conversational search. Information Processing & Management, 2021.
- Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022.
- Convgqr: Generative query reformulation for conversational search. arXiv preprint arXiv:2305.15645, 2023.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
- Hybridialogue: An information-seeking dialogue dataset grounded on tabular and textual data. In Findings of ACL, 2022.
- Ms marco: A human generated machine reading comprehension dataset. choice, 2016.
- OpenAI. Introducing ChatGPT, 2022.
- OpenAI. GPT-4, 2023.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Compositional semantic parsing on semi-structured tables. In ACL, 2015.
- Open-retrieval conversational question answering. In SIGIR, 2020.
- Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
- Know what you don’t know: Unanswerable questions for squad. In ACL, 2018.
- Question rewriting? assessing its importance for conversational question answering. In ECIR, 2022.
- Coqa: A conversational question answering challenge. TACL, 2019.
- Interpretation of natural language rules in conversational machine reading. In EMNLP, 2018.
- Multitask prompted training enables zero-shot task generalization. In ICLR, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017.
- Question rewriting for conversational question answering. In WSDM, 2021a.
- A comparison of question rewriting methods for conversational passage retrieval. In ECIR, 2021b.
- Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713, 2023a.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022a.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In EMNLP, 2022c.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023b.
- Finetuned language models are zero-shot learners. In ICLR, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022b.
- Conqrr: Conversational query rewriting for retrieval with reinforcement learning. In EMNLP, 2022.
- Inscit: Information-seeking conversations with mixed-initiative interactions. TACL, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a.
- Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, 2023b.
- Enhancing conversational search: Large language model-aided informative query rewriting. In EMNLP, pp. 5985–6006, 2023.
- Few-shot generative conversational query rewriting. In SIGIR, 2020.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
- Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In ACL, 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.