Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
9 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research (2302.13522v2)

Published 27 Feb 2023 in cs.LG, cs.AI, cs.DC, and cs.IR

Abstract: Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models. In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous academic graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is a collection of academic graphs designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues for node classification tasks. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging LLMs and GNN research projects. An early public version of IGB is available at https://github.com/IllinoisGraphBenchmark/IGB-Datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. 2023. ArXiv Bulk data. https://arxiv.org/help/bulk_data
  2. Hardware Acceleration of Graph Neural Networks. In 2020 57th ACM/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1109/DAC18072.2020.9218751
  3. Graph Edit Distance Computation via Graph Neural Networks. CoRR abs/1808.05689 (2018). arXiv:1808.05689 http://arxiv.org/abs/1808.05689
  4. SciBERT: A Pretrained Language Model for Scientific Text. (2019). https://doi.org/10.48550/ARXIV.1903.10676
  5. Stephan Bloehdorn and York Sure. 2007. Kernel Methods for Mining Instance Data in Ontologies. In Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference (Busan, Korea) (ISWC’07/ASWC’07). Springer-Verlag, Berlin, Heidelberg, 58–71.
  6. L. C. Blum and J.-L. Reymond. 2009. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. J. Am. Chem. Soc. 131 (2009), 8732.
  7. Aleksandar Bojchevski and Stephan Günnemann. 2017. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815 (2017).
  8. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26 (2013).
  9. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf
  10. Paweł Budzianowski and Ivan Vulić. 2019. Hello, It’s GPT-2 – How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. https://doi.org/10.48550/ARXIV.1907.05774
  11. Relational Graph Attention Networks. (2019). arXiv:arXiv:1904.05811 http://arxiv.org/abs/1904.05811
  12. SPECTER: Document-level Representation Learning using Citation-informed Transformers. https://doi.org/10.48550/ARXIV.2004.07180
  13. A Framework for Large Scale Synthetic Graph Dataset Generation. https://doi.org/10.48550/ARXIV.2210.01944
  14. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry 34, 2 (1991), 786–797.
  15. Convolutional 2D Knowledge Graph Embeddings. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 221, 8 pages.
  16. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://doi.org/10.48550/ARXIV.1810.04805
  17. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 135–144.
  18. Enhancing Graph Neural Network-based Fraud Detectors against Camouflaged Fraudsters. CoRR abs/2008.08692 (2020). arXiv:2008.08692 https://arxiv.org/abs/2008.08692
  19. Matthias Fey and Jan Eric Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. (2019). https://doi.org/10.48550/ARXIV.1903.02428
  20. Karl Pearson F.R.S. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572. https://doi.org/10.1080/14786440109462720
  21. Swapnil Gandhi and Anand Padmanabha Iyer. 2021. P3: Distributed Deep Graph Learning at Scale. In USENIX Symposium on Operating Systems Design and Implementation.
  22. CiteSeer: An Automatic Citation Indexing System. In Proceedings of the Third ACM Conference on Digital Libraries (Pittsburgh, Pennsylvania, USA) (DL ’98). Association for Computing Machinery, New York, NY, USA, 89–98. https://doi.org/10.1145/276675.276685
  23. Graphite: Optimizing Graph Neural Networks on CPUs through Cooperative Software-Hardware Techniques. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 916–931.
  24. Inductive Representation Learning on Large Graphs. CoRR abs/1706.02216 (2017). arXiv:1706.02216 http://arxiv.org/abs/1706.02216
  25. OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs. https://doi.org/10.48550/ARXIV.2103.09430
  26. Open Graph Benchmark: Datasets for Machine Learning on Graphs. CoRR abs/2005.00687 (2020). arXiv:2005.00687 https://arxiv.org/abs/2005.00687
  27. GPT-GNN: Generative Pre-Training of Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1857–1867. https://doi.org/10.1145/3394486.3403237
  28. Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design. In International Conference on Learning Representations. https://openreview.net/forum?id=LI2bhrE_2A
  29. IGB Datasets for public release with leaderboard. https://github.com/IllinoisGraphBenchmark/IGB-Datasets
  30. The Semantic Scholar Open Data Platform. https://doi.org/10.48550/ARXIV.2301.10140
  31. Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. CoRR abs/1609.02907 (2016). arXiv:1609.02907 http://arxiv.org/abs/1609.02907
  32. The suitesparse matrix collection website interface. Journal of Open Source Software 4, 35 (2019), 1244.
  33. Rev2: Fraudulent user prediction in rating platforms. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 333–341.
  34. SmartSAGE: Training Large-Scale Graph Neural Networks Using in-Storage Processing Architectures. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 932–945.
  35. Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
  36. Informative Pseudo-Labeling for Graph Neural Networks with Few Labels. arXiv preprint arXiv:2201.07951 (2022).
  37. RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/ARXIV.1907.11692
  38. S2ORC: The semantic scholar open research corpus. arXiv preprint arXiv:1911.02782 (2019).
  39. EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs. https://doi.org/10.48550/ARXIV.2006.06890
  40. PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses. https://doi.org/10.48550/ARXIV.2101.07956
  41. TUDataset: A collection of benchmark datasets for learning with graphs. CoRR abs/2007.08663 (2020). arXiv:2007.08663 https://arxiv.org/abs/2007.08663
  42. GraphWorld: Fake Graphs Bring Real Insights for GNNs. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM. https://doi.org/10.1145/3534678.3539203
  43. PyTorch: An Imperative Style, High-Performance Deep Learning Library. https://doi.org/10.48550/ARXIV.1912.01703
  44. Know-GNN: An Explainable Knowledge-Guided Graph Neural Network for Fraud Detection. 159–167. https://doi.org/10.1007/978-3-030-92307-5_19
  45. Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. arXiv preprint arXiv:2002.04803 (2020).
  46. Shebuti Rayana and Leman Akoglu. 2015. Collective Opinion Spam Detection: Bridging Review Networks and Metadata. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machinery, New York, NY, USA, 985–994.
  47. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR abs/1908.10084 (2019). arXiv:1908.10084 http://arxiv.org/abs/1908.10084
  48. Microsoft Research. 2022. Microsoft Academic Graphs. https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/
  49. Multi-scale Attributed Node Embedding. CoRR abs/1909.13021 (2019). arXiv:1909.13021 http://arxiv.org/abs/1909.13021
  50. Modeling Relational Data with Graph Convolutional Networks. (2017). https://doi.org/10.48550/ARXIV.1703.06103
  51. Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593–607.
  52. Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
  53. Collective Classification in Network Data. AI Magazine 29, 3 (2008), 93–106.
  54. Pitfalls of Graph Neural Network Evaluation. Relational Representation Learning Workshop, NeurIPS 2018 (2018).
  55. MPNet: Masked and Permuted Pre-training for Language Understanding. https://doi.org/10.48550/ARXIV.2004.09297
  56. EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. (2022). https://doi.org/10.48550/ARXIV.2202.05146
  57. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 conference on empirical methods in natural language processing. 1499–1509.
  58. Graph Attention Networks. (2017). https://doi.org/10.48550/ARXIV.1710.10903
  59. Alex D Wade. 2022. The Semantic Scholar Academic Graph (S2AG). In Companion Proceedings of the Web Conference 2022. 739–739.
  60. Next-Item Recommendation with Sequential Hypergraphs. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1101–1110. https://doi.org/10.1145/3397271.3401133
  61. Improving graph-based label propagation algorithm with group partition for fraud detection. Applied Intelligence 50 (10 2020). https://doi.org/10.1007/s10489-020-01724-1
  62. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv: Learning (2019).
  63. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. https://doi.org/10.48550/ARXIV.1910.03771
  64. A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems 32, 1 (2021), 4–24. https://doi.org/10.1109/TNNLS.2020.2978386
  65. How Powerful are Graph Neural Networks? CoRR abs/1810.00826 (2018). arXiv:1810.00826 http://arxiv.org/abs/1810.00826
  66. GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph. arXiv:2105.02605 [cs.CL]
  67. Deep Bidirectional Language-Knowledge Graph Pretraining. In Neural Information Processing Systems (NeurIPS).
  68. Local Higher-Order Graph Clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). Association for Computing Machinery, New York, NY, USA, 555–564.
  69. GNN Explainer: A Tool for Post-hoc Explanation of Graph Neural Networks. CoRR abs/1903.03894 (2019). arXiv:1903.03894 http://arxiv.org/abs/1903.03894
  70. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. CoRR abs/1806.01973 (2018). arXiv:1806.01973 http://arxiv.org/abs/1806.01973
  71. Performance-Adaptive Sampling Strategy Towards Fast and Accurate Graph Neural Networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, New York, NY, USA, 2046–2056. https://doi.org/10.1145/3447548.3467284
  72. Graph Transformer Networks. https://doi.org/10.48550/ARXIV.1911.06455
  73. GCN-Based User Representation Learning for Unifying Robust Recommendation and Fraudster Detection. CoRR abs/2005.10150. arXiv:2005.10150 https://arxiv.org/abs/2005.10150
  74. Learning on Large-scale Text-attributed Graphs via Variational Inference. arXiv:2210.14709 [cs.LG]
  75. DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE Computer Society, Los Alamitos, CA, USA, 36–44.
  76. Accelerating Large Scale Real-Time GNN Inference Using Channel Pruning. Proc. VLDB Endow. 14, 9 (oct 2021), 1597–1605.
  77. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.
Citations (26)

Summary

  • The paper introduces IGB, a dataset featuring 269 million nodes with 40% labeled, dramatically expanding scale and labeling for graph neural network studies.
  • It employs complementary homogeneous and heterogeneous graph designs with varied classification tasks (from 19 to 2983 classes) to rigorously examine GNN performance.
  • IGB is open-sourced and compatible with popular frameworks like DGL and PyG, fostering accessible, collaborative research in deep learning.

An Overview of the Illinois Graph Benchmark (IGB)

The paper entitled "IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research" introduces the Illinois Graph Benchmark (IGB), a comprehensive research dataset tool tailored for advancing graph neural networks (GNNs) research. The paper is a compelling response to inherent challenges in GNNs, particularly concerning the constraints posed by existing public datasets, which limit scale, labeling comprehensiveness, and feature diversity.

Key Contributions

IGB sets itself apart by addressing the limitations in dataset size, labeling, and flexibility that are evident in current graph datasets utilized in GNN training. The notable contributions of IGB include:

  1. Scale and Labeling: IGB datasets encompass both homogeneous and heterogeneous academic graphs of unprecedented scale, characterized by 269 million nodes and an extensive 40% of them labeled. The magnitude of labeling available (over 162 times that of existing datasets) facilitates more robust model training, enhancing their predictive accuracy and generalizability.
  2. Comprehensive Dataset Design: IGB allows for extensive exploration of different GNN architectures and node classification tasks with varying complexities (19 vs. 2983 classes). This flexibility empowers researchers to conduct systematic studies into GNN performance as a function of dataset characteristics, such as embedding generation techniques and classification complexity.
  3. Open Resource and Compatibility: The dataset is open-sourced under a flexible licensing agreement, promoting wide-scale accessibility and collaboration. Moreover, it supports prevalent frameworks like Deep Graph Library (DGL) and PyTorch Geometric (PyG), facilitating integration into existing research pipelines.

Methodological Advancements

The IGB is formulated from real-world, large-scale academic graph data primarily extracted from the Microsoft Academic Graph (MAG) and the Semantic Scholar Open Research Corpus. The design methodology ensures the dataset's consistency and relevance across homogeneous graphs (IGB-HOM) and multi-typed heterogeneous graphs (IGB-HET). This design facilitates a diverse range of studies pertinent to both structural and semantic graph learning tasks.

The embedding generation process in IGB is particularly noteworthy for using Sentence-BERT, providing a robust starting point for various graph learning tasks. The paper evaluates the impact of using different node embeddings and their dimensions, offering insights into the performance trade-offs and memory efficiency that can be harnessed through such variations.

Implications and Future Directions

The contribution of this dataset addresses foundational gaps in the field of GNNs by delivering a platform where the scale does not inhibit extensive research into model efficiency, scalability, and accuracy across a multitude of architectures. The ability to paper the effect of increased labeled data on model accuracy stands out, offering pathways for more nuanced understandings of GNN generalization capabilities in low-label scenarios.

From a theoretical perspective, IGB facilitates exploration into embedding space learning, understanding the intricacies of heterogeneous graph-based representations, and improving system-level efficiencies derived from distributed graph processing. The open-source nature of IGB further stimulates AI community engagement, enabling modifications and expansion that could lead to new methodologies in graph-based learning.

Conclusion

In summary, the introduction of the Illinois Graph Benchmark represents a robust advancement in the tools available for exploring graph neural networks. By addressing gaps in scale, labeling, and dataset flexibility, IGB provides a necessary scaffold for both practical applications and theoretical advancements in graph-structured learning systems, establishing itself as a pivotal resource for the ongoing evolution of deep learning research in graph contexts.