KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions (2403.03866v1)
Abstract: LLMs adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain. Given a research question, an initial model-generated answer and a set of relevant papers, an expert annotator iteratively issues instructions for the model to revise and improve its answer. We collect 1,260 interaction turns from 234 interaction sessions with three state-of-the-art LLMs. Each turn includes a user instruction, a model response, and a human evaluation of the model response. Through a detailed analysis of the collected responses, we find that all models struggle to incorporate new information into an existing answer, and to perform precise and unambiguous edits. Further, we find that models struggle to judge whether their outputs successfully followed user instructions, with accuracy at least 10 points short of human agreement. Our findings indicate that KIWI will be a valuable resource to measure progress and improve LLMs' instruction-following capabilities for knowledge intensive writing tasks.
- A convergence theory for deep learning via over-parameterization. ArXiv, abs/1811.03962.
- Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268.
- Cheng-Han Chiang and Hung-Yi Lee. 2023. Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics.
- Allan Collins and Dedre Gentner. 1980. A framework for a cognitive theory of writing, pages 51–72. Erlbaum.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Gradient descent provably optimizes over-parameterized neural networks. ArXiv, abs/1810.02054.
- Understanding iterative revision from human-written text. ArXiv, abs/2203.03802.
- Alpacafarm: A simulation framework for methods that learn from human feedback. ArXiv, abs/2305.14387.
- Editeval: An instruction-based benchmark for text improvements. ArXiv, abs/2209.13331.
- ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
- A global convergence theory for deep relu implicit networks via over-parameterization. ArXiv, abs/2110.05645.
- Small pre-trained language models can be fine-tuned as large models via over-parameterization. In Annual Meeting of the Association for Computational Linguistics.
- Google. 2023. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805.
- News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
- Unsupervised dense information retrieval with contrastive learning.
- Openassistant conversations - democratizing large language model alignment. ArXiv, abs/2304.07327.
- Fengfu Li and Bin Liu. 2016. Ternary weight networks. ArXiv, abs/1605.04711.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. ArXiv, abs/2311.09184.
- S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
- The shifted and the overlooked: A task-oriented investigation of user-gpt interactions. In Conference on Empirical Methods in Natural Language Processing.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Coedit: Text editing by task-specific instruction tuning. ArXiv, abs/2305.09857.
- Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3:333–389.
- Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
- Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success). In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1387–1407, Toronto, Canada. Association for Computational Linguistics.
- Beyond summarization: Designing ai support for real-world expository writing tasks. ArXiv, abs/2304.02623.
- Rewritelm: An instruction-tuned large language model for text rewriting. ArXiv, abs/2305.15685.
- Evaluating large language models on controlled generation tasks. In Conference on Empirical Methods in Natural Language Processing.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
- Evaluating large language models at evaluating instruction following. ArXiv, abs/2310.07641.
- Xatu: A fine-grained instruction-based benchmark for explainable text updates. ArXiv, abs/2309.11063.
- (inthe)wildchat: 650k chatgpt interaction logs in the wild.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset. ArXiv, abs/2309.11998.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685.
- Controlled text generation with natural language instructions. ArXiv, abs/2304.14293.