Transformer architecture is becoming a dominant choice in NLP and CV, why not in Graphs?
Solution: Graphormer which uses special positional encoding for graphs.
Problem not mentioned in the paper: graph "positional encoding" often requires shortest path computation between two nodes.
Introduction
Transformer is the most powerful NN in modeling sequential data such as speech and natural language processing.
It's an open question whether Transformers are suitable to model graphs — Graphormer seems to be an affirmative answer.
Most leaderboards and benchmarks in chemistry seem to do well.
Centrality encoding captures node importance in the graph.
Spatial encoding captures the structural relation between nodes — for each node pair, we assign a learnable embedding based on the spatial relation.
Diagram showing the Graphormer architecture with centrality and spatial encoding.
Preliminary
GNNs aim to learn representations of nodes and graphs.
Modern GNNs follow a learning schema that updates the representation of a node by aggregating representations of its neighbors.
Aggregate-combine step:
Formula showing the aggregate-combine step in GNNs.
The task: graph classification.
Centrality Encoding
Learnable embedding applied to each node based on the indegree/outdegree of them.
Centrality encoding formula based on indegree and outdegree.
Spatial Encoding
Biased term is specific to the shortest path distance between $v_i$ and $v_j$.
Spatial encoding formula showing bias term based on shortest path distance.
Special Node
A node [VNode] is added to the graph which is then connected to every individual node (distance of 1). The point is that the representation of the entire graph $h_g$ is the node feature of [VNode].