Emergent Mind

MIRAI: Evaluating LLM Agents for Event Forecasting

(2407.01231)
Published Jul 1, 2024 in cs.CL and cs.AI

Abstract

Recent advancements in LLMs have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

Global event hierarchy, intensity heatmap, and frequency distribution highlighting conflict and mediation areas.

Overview

  • The paper introduces Mirai, a benchmarking framework designed to evaluate Large Language Model (LLM) agents for their ability to forecast international events using structured historical data and news articles.

  • Mirai focuses on evaluating the integration of information from extensive global databases, the use of APIs for code execution, and joint reasoning for accurate event prediction.

  • Key findings reveal the challenges of long-term forecasting, the benefits of diverse data sources, and the superior performance of robust LLM models using multi-line 'Code Block' actions.

Mirai: Evaluating LLM Agents for Event Forecasting

The paper introduces Mirai, a sophisticated benchmark for evaluating Large Language Model (LLM) agents in the context of forecasting international events. This benchmark addresses the need for a systematic evaluation framework given the recent advancements in LLM agents' ability to autonomously collect and reason over world information. International event forecasting, being pivotal for decision-making and policy development, calls for rigorous assessments of LLM agents' capabilities.

Benchmark Design and Capabilities

Mirai provides an agentic environment encompassing tools for accessing a rich database of historical, structured events and textual news articles, predominantly sourced from the Global Database of Events, Language, and Tone (GDELT). The benchmark involves relational prediction tasks that vary in the forecasting horizon, allowing the evaluation of LLM agents on both short-term and long-term forecasting capabilities. Three key dimensions are evaluated:

  1. The ability to source and integrate information from large global databases.
  2. Proficiency in using domain-specific APIs and libraries to write functional code.
  3. Competence in joint reasoning over diverse historical knowledge formats to accurately predict future events.

Dataset and Task Formulation

Mirai represents international events as quadruples (date, subject country, relation, object country), formatted in compliance with the CAMEO ontology's two hierarchical levels. This structure ensures a detailed examination of geopolitical dynamics. The benchmark utilizes processed GDELT data from January 2023 to November 2023, with a test set curated for November 2023, containing 705 query instances and a balanced 100-query subset. Forecasting tasks revolve around predicting relations between countries, leveraging both statistical and real-world contextual analyses.

Agent Interaction and Evaluation

Mirai supports LLM-agent interactions through defined Python APIs, enabling detailed manipulation and query of the underlying databases. The evaluation framework follows the ReAct strategy, entailing iterative steps (Think, Act, Observe) to foster detailed reasoning. Two action types are investigated: straightforward "Single Function" calls and flexible "Code Block" actions. The benchmark rigorously assesses agents across different layers of challenge:

  • F1 Scores are calculated for both first and second-level relation predictions.
  • Empirical Kullback-Leibler (KL) divergence metrics assess discrepancies between predicted and actual relation distributions.

Experimental Findings and Insights

The experiments underscore the complexity of Mirai’s forecasting tasks. Notably, the highest-performing model achieved an F1 score of 29.6 for second-level relation prediction, indicating a high level of difficulty. Key findings include:

  1. Code Blocks benefit robust LLMs: Stronger models like GPT-4 and its variants perform better with "Code Block" actions, suggesting these models’ ability to generate coherent and effective multi-line code.
  2. Diverse information gathering is critical: Models leveraging diverse APIs (both Event and News datasets) performed significantly better than those restricted to a single type of data source.
  3. Temporal distance challenges: Longer forecasting horizons (30 or 90 days) introduce significant challenges, demonstrating that near-term events are easier to predict accurately.

Implications and Future Directions

Mirai sets a precedent in the rigorous benchmarking of LLM agents in temporal event forecasting, highlighting substantial gaps in current capabilities, especially for long-term and fine-grained predictions. The critical insights derived from this benchmark are instrumental in guiding future research efforts in enhancing LLMs' temporal reasoning and tool-use efficiency.

From a practical perspective, Mirai can catalyze the development of more reliable AI models for geopolitical analysis, enabling stakeholders to innovate in international relations strategies. Future research may involve expanding the API functionalities to encompass additional data types such as time-series or multimodal information, thus offering a more comprehensive assessment of LLM agents.

Overall, Mirai positions itself as an essential toolkit for advancing LLM-based forecasting frameworks, enriching the academic and practical exploration of AI's role in understanding and predicting complex geopolitical interactions.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.