Apply different weights to different nodes in the neighborhood without prior knowledge of the graph's structure
Spectral vs non-spectral approaches to graph convolutions
Spectral filters learn on Laplacian eigenbasis, hence they are dependent on graph structure
Non-spectral filters use convolution directly on the graph operating as close neighbors
Sampling a fixed-size neighborhood (weighted for attention, self link for self-attention) and aggregating
Attention Layer
Input: node features $\mathbf{h} = \{h_1, h_2, h_3, \ldots\}$ (vectors)
Output: ditto
At least one linear transformation is required to have higher level features. A weight matrix $W$ can be used for this.
Diagram showing the attention coefficient computation: $e_{ij} = a(\mathbf{W}\vec{h}_i, \mathbf{W}\vec{h}_j)$
Diagram showing the softmax normalization of attention coefficients: $\alpha_{ij} = \mathrm{softmax}_j(e_{ij})$
Where $e_{ij}$ is the importance of node $j$'s features to node $i$
We only compute it for nodes $j$ in $i$'s neighborhood.
Also $\|$ is concatenation
To stabilize the process of self-attention we have found extending one mechanism to employ multi-head attention
Diagram showing multi-head attention concatenation formulaDiagram showing multi-head attention averaging formula for the final layer
Highly efficient as computing attention can be parallelized across multiple edges in $O(|V|FF' + |E|F')$ where $F$ is the number of node features.