IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research (2302.13522v2)
Abstract: Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models. In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous academic graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is a collection of academic graphs designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues for node classification tasks. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging LLMs and GNN research projects. An early public version of IGB is available at https://github.com/IllinoisGraphBenchmark/IGB-Datasets.
- 2023. ArXiv Bulk data. https://arxiv.org/help/bulk_data
- Hardware Acceleration of Graph Neural Networks. In 2020 57th ACM/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1109/DAC18072.2020.9218751
- Graph Edit Distance Computation via Graph Neural Networks. CoRR abs/1808.05689 (2018). arXiv:1808.05689 http://arxiv.org/abs/1808.05689
- SciBERT: A Pretrained Language Model for Scientific Text. (2019). https://doi.org/10.48550/ARXIV.1903.10676
- Stephan Bloehdorn and York Sure. 2007. Kernel Methods for Mining Instance Data in Ontologies. In Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference (Busan, Korea) (ISWC’07/ASWC’07). Springer-Verlag, Berlin, Heidelberg, 58–71.
- L. C. Blum and J.-L. Reymond. 2009. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. J. Am. Chem. Soc. 131 (2009), 8732.
- Aleksandar Bojchevski and Stephan Günnemann. 2017. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815 (2017).
- Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26 (2013).
- Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf
- Paweł Budzianowski and Ivan Vulić. 2019. Hello, It’s GPT-2 – How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. https://doi.org/10.48550/ARXIV.1907.05774
- Relational Graph Attention Networks. (2019). arXiv:arXiv:1904.05811 http://arxiv.org/abs/1904.05811
- SPECTER: Document-level Representation Learning using Citation-informed Transformers. https://doi.org/10.48550/ARXIV.2004.07180
- A Framework for Large Scale Synthetic Graph Dataset Generation. https://doi.org/10.48550/ARXIV.2210.01944
- Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry 34, 2 (1991), 786–797.
- Convolutional 2D Knowledge Graph Embeddings. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 221, 8 pages.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://doi.org/10.48550/ARXIV.1810.04805
- metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 135–144.
- Enhancing Graph Neural Network-based Fraud Detectors against Camouflaged Fraudsters. CoRR abs/2008.08692 (2020). arXiv:2008.08692 https://arxiv.org/abs/2008.08692
- Matthias Fey and Jan Eric Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. (2019). https://doi.org/10.48550/ARXIV.1903.02428
- Karl Pearson F.R.S. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572. https://doi.org/10.1080/14786440109462720
- Swapnil Gandhi and Anand Padmanabha Iyer. 2021. P3: Distributed Deep Graph Learning at Scale. In USENIX Symposium on Operating Systems Design and Implementation.
- CiteSeer: An Automatic Citation Indexing System. In Proceedings of the Third ACM Conference on Digital Libraries (Pittsburgh, Pennsylvania, USA) (DL ’98). Association for Computing Machinery, New York, NY, USA, 89–98. https://doi.org/10.1145/276675.276685
- Graphite: Optimizing Graph Neural Networks on CPUs through Cooperative Software-Hardware Techniques. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 916–931.
- Inductive Representation Learning on Large Graphs. CoRR abs/1706.02216 (2017). arXiv:1706.02216 http://arxiv.org/abs/1706.02216
- OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs. https://doi.org/10.48550/ARXIV.2103.09430
- Open Graph Benchmark: Datasets for Machine Learning on Graphs. CoRR abs/2005.00687 (2020). arXiv:2005.00687 https://arxiv.org/abs/2005.00687
- GPT-GNN: Generative Pre-Training of Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1857–1867. https://doi.org/10.1145/3394486.3403237
- Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design. In International Conference on Learning Representations. https://openreview.net/forum?id=LI2bhrE_2A
- IGB Datasets for public release with leaderboard. https://github.com/IllinoisGraphBenchmark/IGB-Datasets
- The Semantic Scholar Open Data Platform. https://doi.org/10.48550/ARXIV.2301.10140
- Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. CoRR abs/1609.02907 (2016). arXiv:1609.02907 http://arxiv.org/abs/1609.02907
- The suitesparse matrix collection website interface. Journal of Open Source Software 4, 35 (2019), 1244.
- Rev2: Fraudulent user prediction in rating platforms. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 333–341.
- SmartSAGE: Training Large-Scale Graph Neural Networks Using in-Storage Processing Architectures. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 932–945.
- Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
- Informative Pseudo-Labeling for Graph Neural Networks with Few Labels. arXiv preprint arXiv:2201.07951 (2022).
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/ARXIV.1907.11692
- S2ORC: The semantic scholar open research corpus. arXiv preprint arXiv:1911.02782 (2019).
- EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs. https://doi.org/10.48550/ARXIV.2006.06890
- PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses. https://doi.org/10.48550/ARXIV.2101.07956
- TUDataset: A collection of benchmark datasets for learning with graphs. CoRR abs/2007.08663 (2020). arXiv:2007.08663 https://arxiv.org/abs/2007.08663
- GraphWorld: Fake Graphs Bring Real Insights for GNNs. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM. https://doi.org/10.1145/3534678.3539203
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. https://doi.org/10.48550/ARXIV.1912.01703
- Know-GNN: An Explainable Knowledge-Guided Graph Neural Network for Fraud Detection. 159–167. https://doi.org/10.1007/978-3-030-92307-5_19
- Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. arXiv preprint arXiv:2002.04803 (2020).
- Shebuti Rayana and Leman Akoglu. 2015. Collective Opinion Spam Detection: Bridging Review Networks and Metadata. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machinery, New York, NY, USA, 985–994.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR abs/1908.10084 (2019). arXiv:1908.10084 http://arxiv.org/abs/1908.10084
- Microsoft Research. 2022. Microsoft Academic Graphs. https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/
- Multi-scale Attributed Node Embedding. CoRR abs/1909.13021 (2019). arXiv:1909.13021 http://arxiv.org/abs/1909.13021
- Modeling Relational Data with Graph Convolutional Networks. (2017). https://doi.org/10.48550/ARXIV.1703.06103
- Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593–607.
- Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
- Collective Classification in Network Data. AI Magazine 29, 3 (2008), 93–106.
- Pitfalls of Graph Neural Network Evaluation. Relational Representation Learning Workshop, NeurIPS 2018 (2018).
- MPNet: Masked and Permuted Pre-training for Language Understanding. https://doi.org/10.48550/ARXIV.2004.09297
- EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. (2022). https://doi.org/10.48550/ARXIV.2202.05146
- Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 conference on empirical methods in natural language processing. 1499–1509.
- Graph Attention Networks. (2017). https://doi.org/10.48550/ARXIV.1710.10903
- Alex D Wade. 2022. The Semantic Scholar Academic Graph (S2AG). In Companion Proceedings of the Web Conference 2022. 739–739.
- Next-Item Recommendation with Sequential Hypergraphs. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1101–1110. https://doi.org/10.1145/3397271.3401133
- Improving graph-based label propagation algorithm with group partition for fraud detection. Applied Intelligence 50 (10 2020). https://doi.org/10.1007/s10489-020-01724-1
- Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv: Learning (2019).
- HuggingFace’s Transformers: State-of-the-art Natural Language Processing. https://doi.org/10.48550/ARXIV.1910.03771
- A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems 32, 1 (2021), 4–24. https://doi.org/10.1109/TNNLS.2020.2978386
- How Powerful are Graph Neural Networks? CoRR abs/1810.00826 (2018). arXiv:1810.00826 http://arxiv.org/abs/1810.00826
- GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph. arXiv:2105.02605 [cs.CL]
- Deep Bidirectional Language-Knowledge Graph Pretraining. In Neural Information Processing Systems (NeurIPS).
- Local Higher-Order Graph Clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). Association for Computing Machinery, New York, NY, USA, 555–564.
- GNN Explainer: A Tool for Post-hoc Explanation of Graph Neural Networks. CoRR abs/1903.03894 (2019). arXiv:1903.03894 http://arxiv.org/abs/1903.03894
- Graph Convolutional Neural Networks for Web-Scale Recommender Systems. CoRR abs/1806.01973 (2018). arXiv:1806.01973 http://arxiv.org/abs/1806.01973
- Performance-Adaptive Sampling Strategy Towards Fast and Accurate Graph Neural Networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, New York, NY, USA, 2046–2056. https://doi.org/10.1145/3447548.3467284
- Graph Transformer Networks. https://doi.org/10.48550/ARXIV.1911.06455
- GCN-Based User Representation Learning for Unifying Robust Recommendation and Fraudster Detection. CoRR abs/2005.10150. arXiv:2005.10150 https://arxiv.org/abs/2005.10150
- Learning on Large-scale Text-attributed Graphs via Variational Inference. arXiv:2210.14709 [cs.LG]
- DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE Computer Society, Los Alamitos, CA, USA, 36–44.
- Accelerating Large Scale Real-Time GNN Inference Using Channel Pruning. Proc. VLDB Endow. 14, 9 (oct 2021), 1597–1605.
- Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.