👛

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Read the paper

Key Ideas

Introduction

Background: Mini-Batch Training

Diagram of mini-batch training: sampling target nodes and their K-hop neighborhoods for GNN message passing.

System Design: DistDGL

Samplers

Sample the mini-batch graph structures from the input graph.

KVStore

Distributed store for all vertex and edge data (so mini-batches can find each other when necessary).

Trainers

  1. Fetch the mini-batch graphs from the samplers and corresponding vertex/edge features from KVStore.
  2. Run forward/backward in parallel.
  3. Dense model update component synchronized.
DistDGL system architecture: Samplers, KVStore, and Trainers working together across distributed machines.

Graph Partitioning

Uses METIS:

Graph partitioning with METIS: densely connected subgraphs assigned to the same machine partition to minimize cross-machine communication.