Pitfalls of Graph Neural Network Evaluation (1811.05868v2)

Published 14 Nov 2018 in cs.LG, cs.SI, and stat.ML

Abstract: Semi-supervised node classification in graphs is a fundamental problem in graph mining, and the recently proposed graph neural networks (GNNs) have achieved unparalleled results on this task. Due to their massive success, GNNs have attracted a lot of attention, and many novel architectures have been put forward. In this paper we show that existing evaluation strategies for GNN models have serious shortcomings. We show that using the same train/validation/test splits of the same datasets, as well as making significant changes to the training procedure (e.g. early stopping criteria) precludes a fair comparison of different architectures. We perform a thorough empirical evaluation of four prominent GNN models and show that considering different splits of the data leads to dramatically different rankings of models. Even more importantly, our findings suggest that simpler GNN architectures are able to outperform the more sophisticated ones if the hyperparameters and the training procedure are tuned fairly for all models.

Citations (1,216)

View on Semantic Scholar

Summary

The paper demonstrates that reliance on fixed train/validation/test splits biases GNN evaluations and undermines model generalization.
The paper finds that standardizing training procedures and hyperparameter tuning exposes unexpected performance differences among popular GNN models.
The paper shows that simpler models like GCN can outperform complex architectures when assessed under rigorous, randomized experimental setups.

Pitfalls of Graph Neural Network Evaluation

The paper "Pitfalls of Graph Neural Network Evaluation" by Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann addresses significant concerns surrounding the empirical evaluation methodologies employed in Graph Neural Network (GNN) research. This paper focuses on the semi-supervised node classification in graphs—a fundamental task in graph mining.

Key Concerns in GNN Evaluation

The primary critique raised by the authors centers on the reliability and validity of existing evaluation strategies for GNNs. Specifically, the reuse of identical train/validation/test splits from established datasets (e.g., CORA, CiteSeer, PubMed) has led to a consistent, yet flawed, benchmark landscape. This practice inherently favors models capable of overfitting specific data configurations, thus undermining the goal of discerning models with superior generalization capabilities.

Additionally, the inconsistency in training procedures and hyperparameter tuning further muddies the comparative evaluation landscape. When novel GNN architectures are introduced, they are often benchmarked using different early stopping criteria, learning rates, and batch sizes compared to existing models. This heterogeneity in experimental setup obscures the sources of performance improvements, rendering it challenging to attribute gains to architectural advancements or superior tuning.

Experimental Protocol

The authors conducted a comprehensive empirical evaluation of four prominent GNN models: GCN, MoNet, GraphSAGE, and GAT. They standardized the training and hyperparameter selection procedures for these models to ensure a fair comparison. Additionally, the evaluation was extended to include four baseline methods: Logistic Regression (LogReg), Multilayer Perceptron (MLP), Label Propagation, and Normalized Laplacian Label Propagation.

To enrich the evaluation landscape, the authors introduced four new datasets for node classification tasks: Coauthor CS, Coauthor Physics, Amazon Computers, and Amazon Photo. These were added alongside the well-known CORA, CiteSeer, PubMed, and CORA-Full datasets. For each dataset, they generated 100 random train/validation/test splits and conducted 20 random initializations per split, ensuring a robust assessment of the models' generalization ability.

Results and Observations

The experimental results unveil multiple critical insights:

Influence of Data Splits: The performance ranking of GNN models varies drastically across different data splits. For example, while GAT scored highest on the standard CORA and CiteSeer splits, different random splits revealed GCN as the leading model. This variability underscores the fragility and potential misleading nature of single-split evaluations.
Relative Model Performance: GCN emerged as the most consistent performer across multiple datasets, often outperforming more complex architectures like GAT and MoNet. This observation challenges the notion that increased model complexity equates to better performance, especially when hyperparameter tuning is uniformly rigorous.
Baseline Comparisons: GNN-based methods significantly outperformed attribute-only (LogReg, MLP) and structure-only (Label Propagation) baselines. This result corroborates the efficacy of GNN approaches in leveraging both node attributes and graph structure.

Notably, GAT exhibited high variance in its performance on the Amazon datasets, occasionally producing suboptimal results due to certain outlier weight initializations. This finding accentuates the importance of initialization strategies in GNN performance.

Implications and Future Directions

The paper highlights profound implications for both theoretical exploration and practical implementation of GNNs. From a theoretical standpoint, the observed performance of simpler GNN models raises questions about the necessity and utility of increasingly complex architectures. It also suggests that improvements in model performance may often stem from enhanced hyperparameter optimization rather than architectural sophistication.

Practically, this research advocates for more rigorous and standardized evaluation protocols in GNN research. Multiple random splits and standardized training procedures should become a norm, ensuring fairer and more reliable comparisons. The introduction of new datasets also opens avenues for further exploration in diversified graph contexts, beyond the widely-used benchmark datasets.

Conclusion

The paper by Shchur et al. provides an essential critique of current GNN evaluation practices, urging the research community to adopt more robust and standardized methods. By demonstrating the pitfalls of traditional evaluation protocols and the surprising efficacy of simpler GNN models, this work sets the stage for more reliable and insightful advancements in graph neural network research. Future work should explore understanding the conditions under which certain GNN architectures excel and explore novel evaluation metrics attuned to the complexities of graph data.

PDF Markdown