graph2vec: Learning Distributed Representations of Graphs (1707.05005v1)

Published 17 Jul 2017 in cs.AI, cs.CL, cs.CR, cs.NE, and cs.SE

Abstract: Recent works on representation learning for graph structured data predominantly focus on learning distributed representations of graph substructures such as nodes and subgraphs. However, many graph analytics tasks such as graph classification and clustering require representing entire graphs as fixed length feature vectors. While the aforementioned approaches are naturally unequipped to learn such representations, graph kernels remain as the most effective way of obtaining them. However, these graph kernels use handcrafted features (e.g., shortest paths, graphlets, etc.) and hence are hampered by problems such as poor generalization. To address this limitation, in this work, we propose a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs. graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic. Hence, they could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Our experiments on several benchmark and large real-world datasets show that graph2vec achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels.

Citations (686)

View on Semantic Scholar

Summary

The paper presents an unsupervised method that learns embeddings of entire graphs by treating them as documents and their rooted subgraphs as words.
It employs a skipgram model with negative sampling to capture structural equivalence, moving beyond handcrafted graph features.
Experimental results on benchmark datasets and real-world tasks, such as malware detection, demonstrate its effective performance.

Overview of "graph2vec: Learning Distributed Representations of Graphs"

The paper introduces "graph2vec," a neural embedding framework designed to learn distributed representations of entire graphs. Unlike traditional approaches focusing on graph substructures, graph2vec addresses the need to represent entire graphs as fixed-length feature vectors suitable for tasks such as classification and clustering.

Core Contributions

The authors present graph2vec with the following notable features:

Unsupervised Learning: Graph2vec learns embeddings without relying on class labels, ensuring versatility across various applications.
Task-Agnostic Approach: The embeddings learned are not specific to any single machine learning task, permitting reuse in diverse analytical contexts.
Data-Driven Embeddings: By learning from a corpus of graph data, graph2vec circumvents the limitations of handcrafted features that often result in sparse and high-dimensional representations.
Structural Equivalence: Utilizing rooted subgraphs preserves structural equivalence, leading to more accurate representations of graph structures.

Methodology

Graph2vec conceptualizes entire graphs as analogous to documents and rooted subgraphs as analogous to words. This analogy allows the application of document embedding techniques to graph data. The embeddings are data-driven, improving upon traditional graph kernels which rely on manually defined features.

The workflow involves:

Extracting rooted subgraphs from each node.
Employing a skipgram model to learn the graph embeddings using negative sampling, focusing on preserving the composition of the graph through its substructures.

Experimental Evaluation

The authors robustly evaluate graph2vec using both benchmark datasets and real-world applications, such as Android malware detection and familial clustering of malware samples.

Benchmark Datasets: Graph2vec outperformed or matched state-of-the-art methods in three out of five datasets, showcasing its efficacy in standard classification tasks.
Real-World Applications: Graph2vec demonstrated superior accuracy in malware detection and clustering tasks, surpassing other graph embedding methods by significant margins in practical, large-scale datasets.

Implications and Future Directions

Graph2vec offers a versatile tool for a range of graph analytics tasks by providing generic, reusable embeddings. The paper's success posits potential developments in unsupervised representation learning, encouraging investigations into further optimization for larger and more complex graph datasets. Future research could explore hybrid models that integrate task-specific features into the graph2vec framework while preserving its data-driven nature.

In conclusion, graph2vec advances the capabilities of graph representation learning by moving away from the constraints of substructure-focused embeddings and handcrafted kernel methods. Its applicability across multiple domains suggests significant utility in research and industry applications where graph-structured data is prevalent.