Emergent Mind

Abstract

LLMs adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain. Given a research question, an initial model-generated answer and a set of relevant papers, an expert annotator iteratively issues instructions for the model to revise and improve its answer. We collect 1,260 interaction turns from 234 interaction sessions with three state-of-the-art LLMs. Each turn includes a user instruction, a model response, and a human evaluation of the model response. Through a detailed analysis of the collected responses, we find that all models struggle to incorporate new information into an existing answer, and to perform precise and unambiguous edits. Further, we find that models struggle to judge whether their outputs successfully followed user instructions, with accuracy at least 10 points short of human agreement. Our findings indicate that KIWI will be a valuable resource to measure progress and improve LLMs' instruction-following capabilities for knowledge intensive writing tasks.

Distribution of annotator and user ratings for stylistic, info-seeking instructions, and overall session engagement.

Overview

  • The study introduces a dataset named \dsname to evaluate LLMs in revising long-form answers based on user instructions within academic and research settings.

  • Constructed through an interactive system pairing expert annotators with AI models, \dsname captures the process of refining draft answers to research questions, resulting in 1,260 unique interaction sequences.

  • Analysis reveals LLMs often struggle with adding new information and executing precise edits as per instruction, indicating areas for improvement in multi-document summarization and controlled text generation.

  • The insights from \dsname highlight the need for targeted improvements in LLMs for complex editing tasks and offer directions for future research in enhancing AI-powered writing assistants.

Analyzing Instruction-Following in AI Writing Assistants with \dsname

Introduction

Recent advancements in LLMs have significantly impacted various application areas, particularly in providing writing assistance and executing text revisions based on user instructions. Despite their widespread usage, our understanding of LLMs’ capabilities to effectively assist in writing within knowledge-intensive domains is limited. To address this gap, this study introduces \dsname, a dataset meticulously designed to evaluate LLMs’ performance on a task often encountered in academic and research settings: revising long-form answers to accommodate user instructions which may involve adding, editing, or reorganizing information based on a set of scientific documents.

Dataset Construction

\dsname is constructed through an interactive system where expert annotators, possessing knowledge in the domain of NLP, collaborate with AI models to revise text. Given a research question alongside a compilation of relevant scientific papers, an LLM proposes a draft answer. The annotator then iteratively issues instructions to refine the draft—either by integrating additional information or by stylistic editing—until a satisfactory version is produced or a maximum iteration count is reached. This process is captured over 234 sessions working with three state-of-the-art LLMs, resulting in 1,260 unique interaction sequences.

Key Findings

Analysis of the dataset reveals that even the most capable LLMs often struggle to comprehensively address user instructions for text revision. Distinctly, the study highlights challenges in incorporating new information and executing precise edits as per instruction constraints (e.g., specific locations or lengths). Specifically:

  • Incorporating New Information: Models notably faced difficulties in seamlessly adding new information into existing texts. Incorporating content from multiple documents into a coherent answer remains a challenge, suggesting areas for future improvement in multi-document summarization skills.
  • Precise and Constrained Editing: Tasks requiring precise edits or adherence to explicit instruction constraints often led to suboptimal model performance. This pinpointed a limitation in models' abilities to engage in controlled and constrained text generation.
  • Error Analysis: A fine-grained analysis identified common error patterns, with the most prevalent involving the models making unrequested changes to the text. This underscores the need for enhancing models' understanding of instruction scope and context to avoid overstepping boundaries set by specific user requests.

Implications and Future Directions

The insights gleaned from \dsname underscore the necessity for targeted improvements in LLMs tailored for knowledge-intensive writing tasks. It is evident that enhancing models' ability to precisely follow instructions, particularly in complex editing scenarios that require integrating diverse information sources or adhering to specific constraints, is crucial. Furthermore, the discovery of consistent error types offers a clear direction for refining models' text revision capabilities.

For future work, leveraging \dsname for training and evaluating LLMs presents an exciting pathway. Improvement trajectories could include better understanding and parsing of user instructions, advanced integration of information from multiple documents, and refined control over text manipulation to adhere closely to user-specified constraints. Through meticulous examination and targeted model enhancements, the goal of realizing highly adept AI-powered writing assistants for academic and research contexts becomes increasingly attainable.

Conclusion

\dsname provides a valuable resource for understanding and improving instruction-following capabilities of LLMs in the realm of academic writing and revision tasks. By spotlighting existing shortcomings and highlighting areas for model advancements, this dataset paves the way for future research endeavors aimed at harnessing the full potential of AI in augmenting human writing processes.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.