Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

37 tokens/sec

GPT-4o

11 tokens/sec

Gemini 2.5 Pro Pro

37 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

10 tokens/sec

DeepSeek R1 via Azure Pro

33 tokens/sec

2000 character limit reached

Evaluating Very Long-Term Conversational Memory of LLM Agents (2402.17753v1)

Published 27 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context LLMs and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

References (68)

Citations (35)

View on Semantic Scholar

Summary

The paper introduces the LoCoMo dataset, featuring multi-modal dialogues with up to 35 sessions and an average of 9,000 tokens to benchmark extensive conversational memory.
The evaluation framework employs question answering, event summarization, and dialogue generation tasks to assess LLMs' ability to recall and integrate long-term context.
Experimental results reveal that although long-context LLMs improve QA performance, they struggle with adversarial and temporal reasoning compared to human-level understanding.

Evaluating Long-Term Memory Capabilities in LLMs through Extensive Conversational Analysis

Introduction

LLMs have demonstrated remarkable capabilities in generating human-like text across a range of applications. However, their effectiveness in handling very long-term dialogues remains relatively unexplored. To bridge this gap, we present a paper that leverages LLM-based agents to generate and analyze very long-term conversations. Through the introduction of the LoCoMo dataset, which consists of dialogues far exceeding the length and complexity of those previously studied, we establish a comprehensive benchmark for evaluating the long-term memory of conversational AI.

The LoCoMo Dataset

The LoCoMo dataset is unique in its depth and breadth, comprising 50 dialogues that extend over 300 turns and 9,000 tokens on average, spread across up to 35 sessions. Unlike existing conversational datasets, LoCoMo incorporates a multi-modal dimension with image sharing and reaction mechanisms, providing a richer context for dialogue. This dataset is generated through a novel machine-human pipeline ensuring high-quality, consistency, and grounding to predefined personas and temporal event graphs. These conversations emulate real-world interactions closely, making them a potent resource for researching very long-term memory in conversational agents.

Evaluation Framework

Our evaluation framework introduces three distinct tasks designed to test different facets of long-term memory and understanding within conversational models:

Question Answering Task: This task assesses the model's ability to recall and integrate information across dialogues. It spans five reasoning categories, including single-hop, multi-hop, temporal, open-domain knowledge, and adversarial questions.
Event Summarization Task: This evaluates the model's capacity to comprehend and summarize the causal and temporal dynamics depicted within the conversational event graphs.
Multi-modal Dialogue Generation Task: This measures the model's proficiency in leveraging past dialogues and related context to generate consistent and relevant responses, also considering multi-modality (text and images).

Experimental Findings

Our experimental analysis reveals several insights into the current state of LLMs in comprehending and remembering information over long dialogues. While long-context LLMs and RAG strategies show promise, particularly in improving QA performance, they still substantially fall short of human-level understanding, especially in tasks requiring sophisticated temporal reasoning and the integration of complex dialogue history. Key findings include:

Long-context LLMs and RAG offer improvements in QA tasks but lag significantly in areas such as adversarial questioning and event graph summarization.
Base LLMs struggle with maintaining consistency over lengthy dialogues, often failing to correctly utilize their context.
Incorporating elements from the multi-modal dialogues enhances conversational agents' ability to produce more relevant and consistent outputs.

Future Directions

The research underscores the need for further advancements in LLMs to effectively model and understand the intricacies of very long-term conversational memory. Future developments may focus on enhancing contextual understanding and the integration of multi-modal data. Additionally, exploring methods to improve the robustness of conversational agents against adversarial inputs and to better capture temporal and causal relationships in dialogues could be fruitful avenues.

Conclusion

Our paper pushes the boundary of current conversational AI research by focusing on very long-term dialogues and introducing the LoCoMo dataset as a benchmark for evaluating the long-term memory capabilities of LLMs. The findings highlight significant challenges in modeling extensive conversational contexts and point towards the necessity for novel methods that can effectively manage and utilize long-term conversational memories.

GitHub

Tweets

https://twitter.com/adyasha10/status/1762683769830449223

https://twitter.com/fly51fly/status/1771658123352408261

https://twitter.com/_akhaliq/status/1762728342090871295

https://twitter.com/gm8xx8/status/1762685415142621621

https://twitter.com/fiddlerlabs/status/1766146864196288828

https://twitter.com/adyasha10/status/1762683785684857261

YouTube

Show All Videos