How Attentive Are GATs? (GAT v2)
Key Ideas
- Rank of attention in GATv1 is unconditional on the query node.
- Because of this, some graph problems cannot be expressed by GAT.
Introduction
- Proof that GAT doesn't compute a dynamic type of attention. It's only static.
- Static means for any query node, the attention is monotonic with respect to its neighbors.
Preliminaries
Generic GNN layer:
GAT attention scoring:
GAT attention function:
GAT layer:
Static vs Dynamic & Limited Expressivity of GAT
Static Attention
- Family of functions computing scoring for a set of key vectors and query vectors. For every $f$ there's always a highest scoring key.
- This is limiting because regardless of the query, every function has a key that's always selected.
- BUT different keys have different relevance to different queries in real problems. How to express this?
Dynamic Attention
- Every key has different scoring based on the query node. Notice dynamic attention families will have strict subsets of static attention.
Need for a New Scoring Function
- There exists a $j_{\max}$ so that $a_2(Wh_j)$ is maximal for all $j \in V$.
- In that case, due to the monotonicity of LeakyReLU, for every query node $i$, the node $j_{\max}$ is leading to the maximal value of the distribution.
- To avoid this, we can apply the learned attention layer after applying the LeakyReLU non-linearity.
Evaluation
- Certain problems, like the dictionary lookup problem, cannot be learned with static attention (proven in the appendix of the paper).