Emergent Mind

Evaluating Very Long-Term Conversational Memory of LLM Agents

(2402.17753)
Published Feb 27, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context LLMs and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

Framework evaluates models on question answering, event summarization, and multimodal dialog in long-term dialogues.

Overview

  • The study introduces the LoCoMo dataset, a benchmark for evaluating the long-term memory of conversational AI with dialogues far exceeding current datasets in length and complexity.

  • It leverages LLMs and a machine-human pipeline to generate and analyze very long-term conversations, including a multi-modal dimension.

  • Three tasks - question answering, event summarization, and multi-modal dialogue generation - are introduced to assess different aspects of long-term memory and understanding in conversational models.

  • Findings indicate that while long-context LLMs show promise, they still fall short of human-level understanding, especially in complex reasoning and consistency over lengthy dialogues.

Evaluating Long-Term Memory Capabilities in LLMs through Extensive Conversational Analysis

Introduction

LLMs have demonstrated remarkable capabilities in generating human-like text across a range of applications. However, their effectiveness in handling very long-term dialogues remains relatively unexplored. To bridge this gap, we present a study that leverages LLM-based agents to generate and analyze very long-term conversations. Through the introduction of the LoCoMo dataset, which consists of dialogues far exceeding the length and complexity of those previously studied, we establish a comprehensive benchmark for evaluating the long-term memory of conversational AI.

The LoCoMo Dataset

The LoCoMo dataset is unique in its depth and breadth, comprising 50 dialogues that extend over 300 turns and 9,000 tokens on average, spread across up to 35 sessions. Unlike existing conversational datasets, LoCoMo incorporates a multi-modal dimension with image sharing and reaction mechanisms, providing a richer context for dialogue. This dataset is generated through a novel machine-human pipeline ensuring high-quality, consistency, and grounding to predefined personas and temporal event graphs. These conversations emulate real-world interactions closely, making them a potent resource for researching very long-term memory in conversational agents.

Evaluation Framework

Our evaluation framework introduces three distinct tasks designed to test different facets of long-term memory and understanding within conversational models:

  1. Question Answering Task: This task assesses the model's ability to recall and integrate information across dialogues. It spans five reasoning categories, including single-hop, multi-hop, temporal, open-domain knowledge, and adversarial questions.
  2. Event Summarization Task: This evaluates the model's capacity to comprehend and summarize the causal and temporal dynamics depicted within the conversational event graphs.
  3. Multi-modal Dialogue Generation Task: This measures the model's proficiency in leveraging past dialogues and related context to generate consistent and relevant responses, also considering multi-modality (text and images).

Experimental Findings

Our experimental analysis reveals several insights into the current state of LLMs in comprehending and remembering information over long dialogues. While long-context LLMs and RAG strategies show promise, particularly in improving QA performance, they still substantially fall short of human-level understanding, especially in tasks requiring sophisticated temporal reasoning and the integration of complex dialogue history. Key findings include:

  • Long-context LLMs and RAG offer improvements in QA tasks but lag significantly in areas such as adversarial questioning and event graph summarization.
  • Base LLMs struggle with maintaining consistency over lengthy dialogues, often failing to correctly utilize their context.
  • Incorporating elements from the multi-modal dialogues enhances conversational agents' ability to produce more relevant and consistent outputs.

Future Directions

The research underscores the need for further advancements in LLMs to effectively model and understand the intricacies of very long-term conversational memory. Future developments may focus on enhancing contextual understanding and the integration of multi-modal data. Additionally, exploring methods to improve the robustness of conversational agents against adversarial inputs and to better capture temporal and causal relationships in dialogues could be fruitful avenues.

Conclusion

Our study pushes the boundary of current conversational AI research by focusing on very long-term dialogues and introducing the LoCoMo dataset as a benchmark for evaluating the long-term memory capabilities of LLMs. The findings highlight significant challenges in modeling extensive conversational contexts and point towards the necessity for novel methods that can effectively manage and utilize long-term conversational memories.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.