🎇

How Attentive Are GATs? (GAT v2)

Read the paper

Key Ideas

Rank of attention in GATv1 is unconditional on the query node.
Because of this, some graph problems cannot be expressed by GAT.

Introduction

Proof that GAT doesn't compute a dynamic type of attention. It's only static.
- Static means for any query node, the attention is monotonic with respect to its neighbors.

Illustration showing that static attention assigns the same ranking of neighbors regardless of the query node.

Preliminaries

Generic GNN layer:

Generic GNN layer equation: aggregate neighbor representations, apply transformation.

GAT attention scoring:

GAT attention scoring function: $e_{ij} = \text{LeakyReLU}(\vec{a}^T [Wh_i \| Wh_j])$

GAT attention function:

GAT attention coefficients via softmax normalization over the neighborhood.

GAT layer:

Full GAT layer: $h_i' = \sigma(\sum_{j \in N(i)} \alpha_{ij} W h_j)$

Static vs Dynamic & Limited Expressivity of GAT

Static Attention

Family of functions computing scoring for a set of key vectors and query vectors. For every $f$ there's always a highest scoring key.
This is limiting because regardless of the query, every function has a key that's always selected.
BUT different keys have different relevance to different queries in real problems. How to express this?

Dynamic Attention

Every key has different scoring based on the query node. Notice dynamic attention families will have strict subsets of static attention.

Need for a New Scoring Function

Analysis showing that in GATv1, there exists a $j_{\max}$ such that $a_2(Wh_j)$ is maximal for all $j \in V$.

There exists a $j_{\max}$ so that $a_2(Wh_j)$ is maximal for all $j \in V$.
In that case, due to the monotonicity of LeakyReLU, for every query node $i$, the node $j_{\max}$ is leading to the maximal value of the distribution.
To avoid this, we can apply the learned attention layer after applying the LeakyReLU non-linearity.

GATv2 scoring function: $e_{ij} = \vec{a}^T \text{LeakyReLU}(W_l h_i + W_r h_j)$ — applying attention after the non-linearity enables dynamic attention.

Evaluation

Certain problems, like the dictionary lookup problem, cannot be learned with static attention (proven in the appendix of the paper).

Example graph problem that cannot be solved by GATv1's static attention but can be solved by GATv2's dynamic attention.