Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

9 tokens/sec

GPT-4o

12 tokens/sec

Gemini 2.5 Pro Pro

40 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research (2302.13522v2)

Published 27 Feb 2023 in cs.LG, cs.AI, cs.DC, and cs.IR

Abstract: Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models. In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous academic graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is a collection of academic graphs designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues for node classification tasks. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging LLMs and GNN research projects. An early public version of IGB is available at https://github.com/IllinoisGraphBenchmark/IGB-Datasets.

References (77)

Citations (26)

View on Semantic Scholar

Summary

The paper introduces IGB, a dataset featuring 269 million nodes with 40% labeled, dramatically expanding scale and labeling for graph neural network studies.
It employs complementary homogeneous and heterogeneous graph designs with varied classification tasks (from 19 to 2983 classes) to rigorously examine GNN performance.
IGB is open-sourced and compatible with popular frameworks like DGL and PyG, fostering accessible, collaborative research in deep learning.

An Overview of the Illinois Graph Benchmark (IGB)

The paper entitled "IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research" introduces the Illinois Graph Benchmark (IGB), a comprehensive research dataset tool tailored for advancing graph neural networks (GNNs) research. The paper is a compelling response to inherent challenges in GNNs, particularly concerning the constraints posed by existing public datasets, which limit scale, labeling comprehensiveness, and feature diversity.

Key Contributions

IGB sets itself apart by addressing the limitations in dataset size, labeling, and flexibility that are evident in current graph datasets utilized in GNN training. The notable contributions of IGB include:

Scale and Labeling: IGB datasets encompass both homogeneous and heterogeneous academic graphs of unprecedented scale, characterized by 269 million nodes and an extensive 40% of them labeled. The magnitude of labeling available (over 162 times that of existing datasets) facilitates more robust model training, enhancing their predictive accuracy and generalizability.
Comprehensive Dataset Design: IGB allows for extensive exploration of different GNN architectures and node classification tasks with varying complexities (19 vs. 2983 classes). This flexibility empowers researchers to conduct systematic studies into GNN performance as a function of dataset characteristics, such as embedding generation techniques and classification complexity.
Open Resource and Compatibility: The dataset is open-sourced under a flexible licensing agreement, promoting wide-scale accessibility and collaboration. Moreover, it supports prevalent frameworks like Deep Graph Library (DGL) and PyTorch Geometric (PyG), facilitating integration into existing research pipelines.

Methodological Advancements

The IGB is formulated from real-world, large-scale academic graph data primarily extracted from the Microsoft Academic Graph (MAG) and the Semantic Scholar Open Research Corpus. The design methodology ensures the dataset's consistency and relevance across homogeneous graphs (IGB-HOM) and multi-typed heterogeneous graphs (IGB-HET). This design facilitates a diverse range of studies pertinent to both structural and semantic graph learning tasks.

The embedding generation process in IGB is particularly noteworthy for using Sentence-BERT, providing a robust starting point for various graph learning tasks. The paper evaluates the impact of using different node embeddings and their dimensions, offering insights into the performance trade-offs and memory efficiency that can be harnessed through such variations.

Implications and Future Directions

The contribution of this dataset addresses foundational gaps in the field of GNNs by delivering a platform where the scale does not inhibit extensive research into model efficiency, scalability, and accuracy across a multitude of architectures. The ability to paper the effect of increased labeled data on model accuracy stands out, offering pathways for more nuanced understandings of GNN generalization capabilities in low-label scenarios.

From a theoretical perspective, IGB facilitates exploration into embedding space learning, understanding the intricacies of heterogeneous graph-based representations, and improving system-level efficiencies derived from distributed graph processing. The open-source nature of IGB further stimulates AI community engagement, enabling modifications and expansion that could lead to new methodologies in graph-based learning.

Conclusion

In summary, the introduction of the Illinois Graph Benchmark represents a robust advancement in the tools available for exploring graph neural networks. By addressing gaps in scale, labeling, and dataset flexibility, IGB provides a necessary scaffold for both practical applications and theoretical advancements in graph-structured learning systems, establishing itself as a pivotal resource for the ongoing evolution of deep learning research in graph contexts.

PDF Markdown

GitHub

GitHub - IllinoisGraphBenchmark/IGB-Datasets: Largest realworld open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research. (83 stars)

Tweets

https://twitter.com/msharmavikram/status/1800936106193715538

https://twitter.com/msharmavikram/status/1907600042770202780