Emergent Mind

Aligning Transformers with Weisfeiler-Leman

(2406.03148)
Published Jun 5, 2024 in cs.LG

Abstract

Graph neural network architectures aligned with the $k$-dimensional Weisfeiler--Leman ($k$-WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the $k$-WL hierarchy have shown promising empirical results, employing transformers for higher orders of $k$ remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the $k$-WL hierarchy, showing stronger expressivity results for each $k$, making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets. Our code is available at https://github.com/luis-mueller/wl-transformers.

Overview

  • The paper demonstrates how aligning transformer architectures with the Weisfeiler-Leman (WL) hierarchy can enhance their expressive capabilities in graph-based tasks.

  • The authors introduce a theoretical framework for positional encodings that ensures effective utilization in transformers and demonstrate practical feasibility through optimizations and empirical testing using multiple datasets.

  • The empirical results show state-of-the-art performance, proving competitive or superior to existing models while laying the groundwork for future research in transformer expressivity and real-world applications.

Aligning Transformers with Weisfeiler--Leman: A Detailed Examination

The paper "Aligning Transformers with Weisfeiler--Leman" by Luis Müller and Christopher Morris presents a comprehensive study on boosting the expressivity of graph transformers by leveraging the Weisfeiler--Leman (WL) hierarchy. Graph neural networks (GNNs) have benefited extensively from alignments with the WL hierarchy, known for theoretically superior expressive power. However, real-world application performance often disappoints due to constraints in current GNN architectures and their scalability. This paper ambitiously attempts to close the theory-practice gap by exploring new territory in graph transformer expressivity.

Motivation and Contributions

While GNNs synergized with the WL hierarchy exhibit enhanced theoretical capabilities, they falter practically due to runtime and memory complexities, especially when scaling to higher WL orders. In contrast, graph transformers (GTs) like Graphormer have recently outperformed GNNs empirically but lack a clear theoretical grounding tied to WL hierarchy. Bridging this gap, the authors aim to synthesize a hierarchy of "pure" transformers with theoretically guaranteed expressivity improvements.

The work presents key contributions:

  1. Enhanced Transformer Alignment: Demonstrates a stronger alignment of transformer architectures with the WL hierarchy across various orders, from the baseline up to the more expressive $\delta$- aligned structures.
  2. Theoretical Framework for Positional Encodings: Develops a framework for established positional encodings, ensuring that these transformers effectively utilize Laplacian Positional Encodings (PE) and Structural Positional Encodings (SPE).
  3. Practical Feasibility: Introduces optimizations to ensure the practical application of these theoretical frameworks in real-world tasks, tested across multiple datasets.

Theoretical Foundations

The Weisfeiler--Leman algorithm, particularly its multidimensional forms (${k}$-WL), is pivotal in graph isomorphism testing. GNN architectures such as $\delta$-$k$-GNNs and IGNs leverage these alignments for enhanced expressivity. The ${k}$-WL algorithm's ability to differentiate between non-isomorphic graph structures underpins its utility for GNN expressivity.

Müller and Morris take this further by:

  • 1-WL to 2-WL: Introducing the {1} transformer capable of simulating the one-dimensional Weisfeiler--Leman algorithm (1-WL) using adjacency-identifying PEs. They prove that such a transformer can simulate a GNN proven to be equivalent to the 1-WL.
  • Higher Order: They extend their analysis to higher-dimensional cases, i.e., {k} transformers aligned with ${k}$-WL. They establish that for each $k$, there's a corresponding pure transformer that is theoretically more expressive.

Their work aligns explicit token embeddings with WL's theoretical framework, addressing node and adjacency information pragmatically. This pivotal step ensures WL-aligned embeddings are feasible without incurring prohibitive memory and computational costs.

Empirical Results

The authors empirically test their findings using the large-scale PCQM4Mv2 dataset and smaller molecular datasets. They observe the following:

  • State-of-the-Art Performance: Their experimental results show competitive or superior performance compared to existing pure transformer models and some GNN models, albeit with existing transformers.
  • Fine-tuning Benefits: The pre-trained models on large datasets demonstrate pronounced benefits when fine-tuned on smaller task-specific datasets, confirming the practical utility of their theoretical constructs.
  • Expressivity Tests: The transformers' capability to distinguish graph structures aligns well with empirical results on the BREC benchmark, affirming theoretical expressivity claims.

Future Directions and Implications

This paper marks significant progress in understanding and improving the alignment of transformers with the Weisfeiler--Leman hierarchy. Its implications are profound:

  • Theoretical Implications: Establishes a hierarchy of expressively powerful pure transformers, laying the groundwork for future research to explore even more optimized architectures.
  • Practical Impacts: Sparks potential for real-world applications where expressivity and practical feasibility must balance effectively, particularly in molecular and materials science.
  • Future Developments: Further exploration could analyze larger datasets, improve fine-tuning techniques, and reduce computational overheads, potentially by incorporating sparsity-aware methods or more efficient positional encoding strategies.

In conclusion, Müller and Morris have presented a compelling case for the synergistic improvement of transformer architectures by aligning them with the Weisfeiler--Leman hierarchy. Their work not only bridges a critical theory-practice divide but also paves the way for future breakthroughs in graph-based learning architectures.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.