MIRAI: Evaluating LLM Agents for Event Forecasting (2407.01231v1)

Published 1 Jul 2024 in cs.CL and cs.AI

Abstract: Recent advancements in LLMs have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces Mirai, a benchmark to assess LLM agents' performance in temporal event forecasting using structured GDELT events and curated news articles.
The study demonstrates that strategic tool use with ReAct and code execution notably improves performance, with GPT-4o achieving a 29.6 F1 score on second-level relations.
The analysis reveals significant challenges in long-term forecasting and highlights the need for improved temporal reasoning and robust planning in LLM agents.

This paper introduces Mirai, a benchmark designed to evaluate the capabilities of LLM agents in forecasting international events (2407.01231). The authors argue that while LLM agents show promise in autonomously gathering information and reasoning, their effectiveness in the complex domain of international event forecasting lacks rigorous evaluation. Existing methods often rely on single data types (knowledge graphs or text) and lack transparency in their reasoning.

Mirai Benchmark:

Data and Task:
- Mirai uses data derived from the Global Database of Events, Language, and Tone (GDELT), carefully pre-processed and cleaned. It includes structured events and textual news articles from January 1, 2023, to November 30, 2023.
- Events are represented as quadruples $(t, s, r, o)$ , where $t$ is the timestamp (date), $s$ and $o$ are subject and object countries (ISO-3166 codes), and $r$ is the relation type based on the CAMEO ontology (using both first-level two-digit and second-level three-digit codes).
- The forecasting task is defined as predicting the set of relations $\mathcal{R}_{s,o}^{t+l}$ between countries $s$ and $o$ that will occur $l$ days in the future, given all historical information up to time $t$ . Queries are formulated like $(t, s, ?, o)$ .
Agentic Environment and APIs:
- Mirai provides an environment where LLM agents interact with the database using a code-based interface through APIs.
- Agents use a ReAct-style (Think, Act, Observe) iterative process.
- The API includes data classes (Date, ISOCode, CAMEOCode, Event, etc.) and functions to query historical events, news articles, country/relation information, and distributions. Functions support filtering by date range, entities, relations, and text descriptions.
- Two action types are supported for the "Act" step:
  - Single Function: Executes a single, predefined API function call.
  - Code Block: Executes a multi-line Python code snippet, allowing complex logic, loops, conditionals, and use of libraries like numpy, networkx, scikit-learn.
- The environment executes the generated code in a sandbox and returns the output (or error message) as the "Observation".
Database Construction:
- GDELT data was filtered (Jan-Nov 2023), standardized (ISO codes, CAMEO second-level), cleaned (removing low-quality/domestic events, >50 mentions threshold), and aligned with news publish dates.
- News articles were downloaded and cleaned using the OBELICS protocol.
- The final database contains ~992k GDELT records (59k unique events) and ~297k news articles.
- A test set of 705 queries was constructed from November 2023 data using stricter filtering (>=100 mentions, >=5 news articles), resulting in 2,136 unique events for ground truth answers. A balanced subset of 100 queries was also created.
Evaluation Metrics:
- Forecasts are evaluated using Precision, Recall, and F1 scores for both first-level and second-level predicted CAMEO codes against the ground truth.
- Kullback-Leibler (KL) divergence is used to measure the discrepancy between the predicted and ground-truth distributions over binary (Conflict/Cooperation) and quadratic (Verbal/Material Conflict/Cooperation) relation classes.

Experiments and Findings:

Agent Performance:
- Temporal event forecasting in Mirai is challenging; the best agent (GPT-4o with full API access) achieved only a 29.6 F1 score on second-level relation prediction.
- Predicting fine-grained (second-level) relations is harder than first-level relations. Long-term forecasting (larger $l$ ) also significantly degrades performance.
- Tool-use (ReAct agents) significantly outperforms non-tool-use baselines (Direct IO, ZS-CoT), highlighting the need for grounding forecasts in retrieved data.
- Access to both structured event data and textual news data yields the best results, though news data alone performs poorly, potentially due to noise and long context issues.
Base LLMs and Action Types:
- GPT-4o consistently outperformed GPT-4-Turbo, GPT-3.5-Turbo, and Mistral-7B.
- The "Code Block" action type improved performance for GPT-4 models but hurt performance for GPT-3.5 and Mistral-7B, indicating that effective use of this flexible but complex action space requires strong code generation capabilities.
- Code execution errors (e.g., invalid dates, invalid attributes) were frequent, especially for weaker models like Mistral-7B. GPT-4o exhibited significantly fewer errors.
Analysis:
- Self-Consistency: Applying self-consistency sampling significantly boosted the performance of Mistral-7B, showing potential for inference-time search methods.
- Temporal Distance: Performance degrades as the forecasting horizon ( $l$ ) increases (tested for $l$ =1, 7, 30, 90 days). Long-term forecasting (30/90 days) poses a greater challenge.
- Relation Types: Agents performed better at predicting "verbal cooperation" (more frequent) and "material conflict" (more persistent) events compared to the more abrupt and less predictable "material cooperation" and "verbal conflict" events.
- Tool-Use Ordering: Analysis of GPT-4o's action sequences revealed common patterns (e.g., starting with get_relation_distribution or get_event, ending with browse_news_article). Strategic sequences (e.g., using get_news_articles followed by browse_news_article) led to better outcomes, emphasizing the importance of planning in tool use.

Conclusion:

Mirai provides a challenging benchmark for evaluating LLM agents on temporal event forecasting. Current agents struggle, especially with fine-grained, long-term predictions and complex code generation for tool use. The benchmark highlights the need for improvements in temporal reasoning, robust tool use, and strategic planning for LLM agents. The authors provide the dataset, code, and an interactive demo to facilitate further research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_vztu/status/1808536145678172173

https://twitter.com/fly51fly/status/1808258216847106447

https://twitter.com/digitalhealthxx/status/1808370243481620740

https://twitter.com/GptMaestro/status/1808727322528526361

https://twitter.com/knishimae0531/status/1808302064378040679

https://twitter.com/gm8xx8/status/1807979168467918867