Emergent Mind

Abstract

Multi-relational temporal graphs are powerful tools for modeling real-world data, capturing the evolving and interconnected nature of entities over time. Recently, many novel models are proposed for ML on such graphs intensifying the need for robust evaluation and standardized benchmark datasets. However, the availability of such resources remains scarce and evaluation faces added complexity due to reproducibility issues in experimental protocols. To address these challenges, we introduce Temporal Graph Benchmark 2.0 (TGB 2.0), a novel benchmarking framework tailored for evaluating methods for predicting future links on Temporal Knowledge Graphs and Temporal Heterogeneous Graphs with a focus on large-scale datasets, extending the Temporal Graph Benchmark. TGB 2.0 facilitates comprehensive evaluations by presenting eight novel datasets spanning five domains with up to 53 million edges. TGB 2.0 datasets are significantly larger than existing datasets in terms of number of nodes, edges, or timestamps. In addition, TGB 2.0 provides a reproducible and realistic evaluation pipeline for multi-relational temporal graphs. Through extensive experimentation, we observe that 1) leveraging edge-type information is crucial to obtain high performance, 2) simple heuristic baselines are often competitive with more complex methods, 3) most methods fail to run on our largest datasets, highlighting the need for research on more scalable methods.

Overview

  • TGB 2.0 introduces eight new large and diverse datasets for temporal knowledge graphs (TKGs) and temporal heterogeneous graphs (THGs), spanning five different domains and significantly surpassing the size of existing datasets.

  • The benchmark provides a realistic and reproducible evaluation pipeline, addressing inconsistencies in current methodologies and incorporating rigorous ranking metrics and sampling strategies to facilitate robust comparisons across methods.

  • Experimental insights reveal the importance of leveraging edge-type information, the competitive performance of heuristic baselines, and the scalability issues of current methods, particularly with larger datasets.

Review of TGB 2.0: A Comprehensive Benchmark for Temporal Knowledge Graphs and Heterogeneous Graphs

Temporal knowledge graphs (TKGs) and temporal heterogeneous graphs (THGs) are critical tools for modeling dynamic, multi-relational data. Addressing the pressing need for robust evaluation and standardized benchmark datasets, the paper "TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs" introduces the Temporal Graph Benchmark 2.0 (TGB 2.0). This new benchmark is designed to facilitate the comparative analysis of machine learning methods for predicting future links in large-scale temporal graphs.

Key Contributions

  1. Large and Diverse Dataset Collection: TGB 2.0 proposes eight new datasets, categorized into four TKGs and four THGs, spanning five distinct domains. These datasets significantly surpass the size of existing ones, encompassing up to 53 million edges and covering domains from socio-political networks to software interactions. The detailed characteristics of these datasets highlight their variability in node count, edge count, and time steps, ranging from yearly increments to second-wise interactions.
  2. Realistic and Reproducible Evaluation Pipeline: TGB 2.0 addresses evaluation inconsistencies prevalent in current methodologies. By introducing an automated pipeline that ensures reproducible benchmarking, the framework includes rigorous ranking metrics like Mean Reciprocal Rank (MRR) and sampling strategies that incorporate edge type information. This setup not only mitigates overly optimistic performance assessments but also facilitates robust and fair comparisons across different methods.
  3. Experimental Insights: Through extensive experiments on the TGB 2.0 datasets, three main insights were uncovered:
  • The leverage of edge-type information is crucial for achieving high performance.
  • Simple heuristic baselines often compete well against more complex models.
  • Current methods struggle with scalability, particularly evident as most methods fail to run on the largest datasets due to either out-of-memory (OOM) or out-of-time (OOT) errors.

Dataset Details and Statistics

The datasets provided by TGB 2.0 are meticulously detailed in terms of their characteristics and intended use:

  • TKGs: Includes tkgl-smallpedia, tkgl-polecat, tkgl-icews, and tkgl-wikidata, representing domains from knowledge bases to socio-political events.
  • THGs: Comprising thgl-software, thgl-forum, thgl-myket, and thgl-github, these datasets cover software and social interactions.

Each dataset is split chronologically into training, validation, and test sets, maintaining the typical 70/15/15% split, ensuring temporal coherence in the evaluation. Detailed statistics show the diversity in recurrent relations, edge distributions, and inductive capabilities.

Experimental Protocol

The evaluation focuses on the dynamic link prediction task, treating it as a ranking problem. The benchmark uses two negative sampling strategies (1-vs-all and 1-vs-q), ensuring temporal conflicts are removed for correctness. Methods that exceed 40 GB of GPU memory or run for more than a week are categorized as OOM or OOT. This strategy allows the benchmarking of methods at scale while ensuring realistic and fair evaluation conditions.

Methods and Comparative Analysis

Through a rigorous analysis of various state-of-the-art methods, including RE-GCN, TLogic, CEN, TGN, and STHN, TGB 2.0 demonstrates that no single method excels universally across all datasets. Notable observations include:

  • Heuristic Baselines: The Recurrency Baseline and EdgeBank methods, while simple, often perform competitively. They are particularly valuable for their scalability to larger datasets.
  • Method Performance: While sophisticated models like RE-GCN and CEN perform well on smaller datasets, they encounter significant scalability issues with larger datasets like tkgl-wikidata.

Implications and Future Directions

The results from TGB 2.0 underscore the need for further research into scalable methods for temporal graph learning. The strong performance of simple heuristic methods suggests that there is substantial room for improvement in current approaches. The benchmark sets a new standard for evaluating machine learning models on temporal graphs, promoting more rigorous and reproducible research.

In conclusion, TGB 2.0 represents a significant step forward in the evaluation of methods for temporal knowledge graphs and heterogeneous graphs. By providing large, diverse datasets and a comprehensive evaluation pipeline, it offers invaluable tools for advancing research in this field. Future developments should focus on enhancing the scalability of models and incorporating additional datasets, thereby continuing to push the boundaries of what is achievable in temporal graph learning.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.