Attributed Graph Clustering via Adaptive Graph Convolution (1906.01210v1)

Published 4 Jun 2019 in cs.LG, cs.AI, and stat.ML

Abstract: Attributed graph clustering is challenging as it requires joint modelling of graph structures and node attributes. Recent progress on graph convolutional networks has proved that graph convolution is effective in combining structural and content information, and several recent methods based on it have achieved promising clustering performance on some real attributed networks. However, there is limited understanding of how graph convolution affects clustering performance and how to properly use it to optimize performance for different graphs. Existing methods essentially use graph convolution of a fixed and low order that only takes into account neighbours within a few hops of each node, which underutilizes node relations and ignores the diversity of graphs. In this paper, we propose an adaptive graph convolution method for attributed graph clustering that exploits high-order graph convolution to capture global cluster structure and adaptively selects the appropriate order for different graphs. We establish the validity of our method by theoretical analysis and extensive experiments on benchmark datasets. Empirical results show that our method compares favourably with state-of-the-art methods.

Citations (271)

View on Semantic Scholar

Summary

The paper introduces AGC, a novel method that adaptively selects the optimal k-order graph convolution to enhance feature smoothing for clustering.
AGC leverages iterative low-pass filtering using a symmetric normalized Laplacian to aggregate multi-hop neighborhood information and improve cluster compactness.
Experimental results on datasets like Cora and Citeseer show AGC outperforms fixed-order GCN methods by achieving higher accuracy, NMI, and F1 scores.

The paper "Attributed Graph Clustering via Adaptive Graph Convolution" (1906.01210) introduces the Adaptive Graph Convolution (AGC) method for attributed graph clustering. This technique aims to improve upon existing methods by leveraging higher-order graph convolutions to capture global cluster structures and by adaptively determining the optimal convolution order for a given graph, addressing the limitation of fixed, low-order convolutions commonly used in Graph Convolutional Network (GCN) based approaches.

Methodology: Adaptive Graph Convolution (AGC)

The core idea of AGC is to pre-process node features using a graph convolution operator specifically designed as a low-pass filter, thereby smoothing the features according to the graph topology. This smoothing encourages nodes within the same cluster (presumed to be densely connected) to have more similar feature representations. The process is decoupled from deep learning architectures; it's a feature transformation step followed by a standard clustering algorithm.

1. k-Order Low-Pass Graph Convolution:

AGC employs a specific graph filter derived from the symmetrically normalized Laplacian, $L_s = I - D^{-1/2} A D^{-1/2}$ , where $A$ is the adjacency matrix and $D$ is the degree matrix. The chosen base filter is $G = I - 0.5 L_s$ . This filter is applied iteratively $k$ times to the original node feature matrix $X$ :

$\bar{X} = G^k X = (I - 0.5 L_s)^k X$

This operation constitutes a k-order graph convolution. Each application of $G$ effectively averages a node's features with those of its immediate neighbors. Applying $G^k$ aggregates information from neighbors up to $k$ hops away, acting as a low-pass filter in the graph spectral domain. The frequency response of $G$ is $p(\lambda) = 1 - 0.5\lambda$ , where $\lambda$ represents the eigenvalues of $L_s$ . Since $0 \le \lambda \le 2$ for $L_s$ , the response $p(\lambda)$ is non-negative and non-increasing, satisfying the conditions for a low-pass filter that smooths the signal (node features) by attenuating high-frequency components associated with feature variations between adjacent nodes.

2. Spectral Clustering on Smoothed Features:

After obtaining the smoothed feature matrix $\bar{X}$ , a similarity matrix $W$ is constructed. The paper uses a linear kernel: $K = \bar{X} \bar{X}^T$ . To ensure symmetry and non-negativity, the final similarity matrix is computed as:

$W = 0.5 \times (|K| + |K^T|)$

Standard spectral clustering is then applied to this similarity matrix $W$ to partition the nodes into $C$ clusters.

3. Adaptive Order Selection:

A key component of AGC is the adaptive selection of the convolution order $k$ . Using a fixed $k$ is suboptimal, as different graphs require different degrees of smoothing, and excessive smoothing (very large $k$ ) can merge distinct clusters. AGC determines $k$ iteratively:

Algorithm 1: Adaptive Order Selection for AGC

Input: Feature matrix X, Adjacency matrix A, Number of clusters C
Output: Cluster partition C^(k*)

1: Compute Ls = I - D^(-1/2) A D^(-1/2)
2: Initialize smoothed features bar(X)^(0) = X
3: Initialize k = 1
4: Initialize intra_dist_prev = infinity

5: loop
6:    Compute bar(X)^(k) = (I - 0.5 Ls) * bar(X)^(k-1)  // Eq. (10)
7:    Compute similarity W^(k) from bar(X)^(k) using linear kernel and symmetrization
8:    Perform Spectral Clustering on W^(k) to get partition C^(k)
9:    Compute intra-cluster distance intra(C^(k)) using Eq. (11):
       intra(C^(k)) = (1/N) * sum_{c=1 to C} sum_{i in Cluster c} || bar(x)_i^(k) - mean(bar(x)_j^(k) for j in Cluster c) ||^2
10:   if intra(C^(k)) > intra_dist_prev or k reaches max_iterations then
11:      k* = k - 1  // Select previous order
12:      Output C^(k*)
13:      break
14:   end if
15:   intra_dist_prev = intra(C^(k))
16:   k = k + 1
17: end loop

The algorithm iteratively increases the convolution order $k$ , performs clustering, and calculates the average intra-cluster variance (distance) using the smoothed features $\bar{X}^{(k)}$ . It stops and selects the order $k-1$ corresponding to the first local minimum encountered in the intra-cluster distance. This signifies a point where clusters are compact, but further smoothing starts to blend them, increasing the variance within the resulting larger, merged clusters.

Theoretical Analysis

The paper provides theoretical justification for the feature smoothing effect.

Smoothness Quantification: The smoothness of a graph signal $f$ (a column of the feature matrix $X$ ) is measured using the graph Laplacian quadratic form, related to the Laplacian-Beltrami operator:

$\Omega(f / \|f\|_2) = \frac{f^T L f}{f^T f}$

where $L$ is a graph Laplacian (e.g., $L_s$ ). Lower values indicate smoother signals, meaning connected nodes have more similar values.

Theorem 1: This theorem states that applying a graph filter $G$ whose frequency response $p(\lambda)$ is non-negative and non-increasing on the spectrum of the Laplacian to a signal $f$ results in a smoother or equally smooth signal $\bar{f} = Gf$ :

$\Omega(\bar{f} / \|\bar{f}\|_2) \le \Omega(f / \|f\|_2)$

Implication: Since the chosen filter $G = I - 0.5 L_s$ has a frequency response $p(\lambda) = 1 - 0.5\lambda$ , which is non-negative and non-increasing for $\lambda \in [0, 2]$ (the range of eigenvalues for $L_s$ ), Theorem 1 applies. Applying $G^k$ means iteratively applying such a smoothing filter. Consequently, as $k$ increases, the features $\bar{X}$ become progressively smoother with respect to the graph structure. This aligns with the clustering objective, as nodes within dense subgraphs (putative clusters) should ideally have similar representations. The adaptive selection mechanism is motivated by the fact that excessive smoothing ( $k \to \infty$ ) would make all node features converge to the same value (related to the graph's principal eigenvector), destroying cluster structure.

Experimental Validation

AGC was evaluated on four standard benchmark datasets: Cora, Citeseer, Pubmed, and Wiki.

Performance Comparison: AGC was compared against various baselines:

Feature-only methods (k-means, spectral clustering on features).
Structure-only methods (spectral clustering on graph, DeepWalk, DNGR).
Attributed graph clustering methods (GAE, VGAE, MGAE, ARGE, ARVGE).

Results showed that AGC consistently achieved state-of-the-art or highly competitive performance across all datasets using standard metrics (Accuracy - Acc, Normalized Mutual Information - NMI, F1-score). Notably, AGC demonstrated significant improvements over GAE/VGAE and ARGE/ARVGE on Cora, Citeseer, and Pubmed. The paper attributes this to AGC's ability to leverage higher-order structural information via the adaptive k-order convolution, whereas baseline GCN-based methods typically rely on fixed 2 or 3-layer architectures (equivalent to 2 or 3-hop information aggregation).

Validation of Adaptive k: Experiments confirmed the effectiveness of the adaptive selection strategy. Plots showed that the automatically selected order $k^*$ (where the intra-cluster distance first increased) closely corresponded to the order yielding optimal or near-optimal clustering performance metrics (Acc, NMI, F1). The optimal k varied significantly across datasets (e.g., $k^*=12$ for Cora, $k^*=55$ for Citeseer, $k^*=60$ for Pubmed, $k^*=8$ for Wiki), underscoring the necessity of the adaptive approach rather than a fixed k.

Efficiency and Stability: AGC exhibited low variance across multiple runs. Computationally, it avoids the parameter training overhead of deep learning models. The primary costs are the $k$ sparse matrix-vector multiplications (or dense matrix multiplications if implemented that way) and the spectral clustering step.

Implementation Considerations

Implementing AGC involves several key steps:

Laplacian Computation: Calculate the symmetrically normalized Laplacian $L_s = I - D^{-1/2} A D^{-1/2}$ . This requires computing the degree matrix $D$ from the adjacency matrix $A$ . Care must be taken with sparse matrix representations for efficiency, especially for large graphs.
k-Order Convolution: Implement the iterative application $\bar{X}^{(k)} = (I - 0.5 L_s) \bar{X}^{(k-1)}$ . This is essentially $k$ steps of feature propagation/smoothing. Using sparse matrix multiplication libraries (e.g., scipy.sparse in Python) is crucial for scalability. The complexity per iteration is roughly proportional to the number of edges if $A$ is sparse, or $O(N^2 d)$ for dense matrix multiplication if features have dimension $d$ . Total cost for this stage is $O(k \times \text{ComplexityPerIteration})$ .
Similarity Matrix: Compute $K = \bar{X}^{(k)} (\bar{X}^{(k)})^T$ . This can be computationally intensive ( $O(N^2 d)$ ). Then compute $W = 0.5 \times (|K| + |K^T|)$ .
Spectral Clustering: Apply spectral clustering to $W$ . Standard implementations often involve computing the top $C$ eigenvectors of the Laplacian derived from $W$ , which typically takes $O(N^3)$ time for dense eigen-decomposition, although faster methods exist (e.g., using iterative solvers like LOBPCG if only a few eigenvectors are needed, potentially reducing complexity closer to $O(N^2)$ or less depending on sparsity and solver efficiency).
Adaptive Loop: Enclose steps 2-4 within the loop described in Algorithm 1, calculating the intra-cluster distance at each step to find the optimal $k^*$ . The maximum value for $k$ needs consideration; the paper suggests stopping criteria based on the distance increase or a maximum iteration count.

The overall complexity is significantly influenced by the chosen $k^*$ , the size of the graph $N$ , the feature dimension $d$ , and the efficiency of the sparse matrix operations and spectral clustering implementation. For very large $k^*$ or dense graphs, the computation can become substantial.

Conclusion

The AGC method provides an effective approach for attributed graph clustering by using a theoretically grounded low-pass graph filter to smooth node features over potentially high-order neighborhoods. Its adaptive mechanism for selecting the convolution order $k$ allows it to tailor the degree of smoothing to specific graph characteristics, avoiding the limitations of fixed-order methods and demonstrating strong empirical performance on benchmark datasets. The decoupling of feature smoothing from complex neural network training offers a potentially simpler and efficient alternative for combining structural and attribute information for clustering.