DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs
Key Ideas
- Based on the DGL (Deep Graph Library).
- Synchronous-training approach, ego-networks forming mini-batches.
- Min-cut partitioning algorithm.
Introduction
- Challenging to train GNNs because of the inherent dependencies between nodes (vertices) — the next node can be on a completely different machine.
- Each mini-batch must incorporate dependent samples (1-hop, 2-hop, etc.). Sampling algorithm is important.
- As opposed to regular distributed NN training where the majority of network traffic comes from exchanging gradients, distributed GNN traffic comes from reading hundreds of neighbor vertices.
Background: Mini-Batch Training
- Sample a set of $N$ vertices, uniformly at random from the training set.
- Pick at most $K$ neighbor vertices (fan-out) for each target.
- Compute the target vertex representations by gathering messages.
System Design: DistDGL
Samplers
Sample the mini-batch graph structures from the input graph.
KVStore
Distributed store for all vertex and edge data (so mini-batches can find each other when necessary).
- Special KV-store instead of Redis or similar because we need better co-location of node/edge features in KVStore and graph partitions.
- Efficient updates for sparse embeddings.
- Flexible partition policies to map data to different machines.
- Supports sparse embedding for training transductive models with learnable vertex embeddings.
Trainers
- Fetch the mini-batch graphs from the samplers and corresponding vertex/edge features from KVStore.
- Run forward/backward in parallel.
- Dense model update component synchronized.
Graph Partitioning
Uses METIS:
- Densely connected vertices are assigned to the same partition to reduce the number of edge cuts.
- Minimizing edge-cut improves GNN performance as we need to check the KVStore less and less.