Training Graph Neural Networks on Large Graphs
Graph neural networks (GNNs) are a powerful tool for applying machine learning techniques to graph data. They allow us to generate or transform features on vertices and edges using deep neural networks to make them useful for a wide range of applications including fraud detection, product recommendation, and drug discovery.
One of the challenges with using GNNs is that real-world graphs such as social networks and financial transaction graphs can become very large with billions of vertices and trillions of edges. These graphs may not fit on a single machine as they can easily reach terabyte scale. Training such GNNs in reasonable time requires the use of high-performance processors and distributed systems.
To address this challenge, Katana Graph and Intel have partnered to create the world's fastest distributed CPU-based GNN system by utilizing Katana Graph's scale-out graph computation engine and Intel's 4th-generation Xeon CPUs. This system delivers record-breaking performance on very large graphs. In this article, we describe the system’s major components and their contribution to overall performance.
Streaming Partitioner
Training GNNs on large graphs requires partitioning the dataset using a cluster of machines. Many existing systems use offline graph partitioners, a category that includes hierarchical graph partitioners like Metis and spectral graph partitioners that solve eigenvector problems to find the partitions. These partitioners attempt to optimize communication and load balance for sparse matrix-vector products but the resulting partitions are not necessarily optimized for GNN workloads. Moreover, the overhead in computing the partitions can be substantial.
Katana Graph offers a seamless and efficient partitioner called the Customizable Streaming Partitioner (CuSP) [1], which allows for quick and flexible partitioning of the graph in a streaming manner. Unlike offline partitioners like METIS, CuSP does not continually refine owner assignments of vertices and edges, which is a time-consuming process. This may result in potentially lower-quality partitions, but as a tradeoff, it can have significantly faster partitioning time, especially for very large graphs. CuSP supports arbitrary partitioning policies including edge-cuts and vertex-cuts, giving users the flexibility to choose the best policy for their particular GNN. Katana Graph also augments CuSP’s efficient partitioning with efficient communication of feature vector properties. This is essential for graphs used with GNNs which often have large input feature vectors.
Multi-Minibatch Sampling and Feature Fetching
Existing systems for training GNNs on large graphs often sample one minibatch at a time, which can lead to missed opportunities for performance optimization through data reuse across minibatches. For example, multiple minibatches may include the same nodes, but their features may get fetched multiple times across different minibatches.
Katana Graph addresses this issue with its multi-minibatch sampler, which reduces sampling and property fetching overhead by merging the communication that would occur over multiple minibatches into a single multi-minibatch. This is made possible by leveraging excess memory on CPU machines to fetch the sampled subgraph and features for many minibatches at once instead of just one minibatch at a time.
As an example, consider an epoch with M minibatches. Normally, M minibatches would require M communication relays (i.e., back-and-forth requests for sampling for each GNN layer), in which the L-hop sampled subgraph and its features are requested and fetched from other devices. These relays cause the process responsible for communication to block for request responses and limit the overall execution speed. Instead of performing M relays for M minibatches, Katana Graph does one relay for B minibatches at once, reducing the number of communication relays by a factor of B. This has two main benefits: (1) the number of times the sampler process needs to block waiting for messages is reduced by a factor of B, and (2) vertex properties that would have been redundantly fetched in different minibatches are fetched once per B minibatches. The first benefit follows from the reduction in the number of relays, and the second benefit follows from the design of Katana Graph's feature fetching, which fetches features only after the vertices of all B subgraphs are known. This process is illustrated in the figure below:

Reducing Overhead with Native GNN Layers
Katana Graph's GNN system uses PyTorch for training neural networks, which relies on a system called Autograd to compute gradients used to update the GNN model. However, Autograd does not support in-place operations in the GNN computation, which can cause additional memory allocation and slow down execution. In addition, the allocation of intermediate memory buffers for use in the backward computation graph may add overhead to end-to-end training time.
To address this issue, Katana Graph has implemented native GNN layers for GraphSAGE, ReLU and dropout that: (1) preallocate slightly more memory than is required as buffer space for intermediate tensors in the GNN computation and reuse it for all minibatches instead of allocating and deallocating it for every minibatch, and (2) manually track the information required to perform the backward phase for GraphSAGE, ReLU, and dropout in order to allow for the use of in-place operations that Autograd cannot handle natively. These layers are integrated directly into Autograd so that the backward phase can seamlessly occur with other PyTorch modules. This approach minimizes memory allocation overhead via reuse of existing memory and ensures that reallocation only occurs when a larger minibatch than has been previously seen is processed. Once the largest possible minibatch has been allocated for, no further reallocation is necessary.
Experimental Results
We compared Katana Graph’s GNN system against DistDGL v0.9.0 [2] in a distributed setting on 8 Intel 4th-generation Xeon machines each with 2 CPU sockets. We ran 1 training rank on each socket, so we show results for up to 16 ranks. We run a 3-layer GraphSAGE model with fanout 15, 10, and 5 (15 being the fanout of the seed nodes) and a hidden layer size of 256. The minibatch size used is 1000.
On a symmetric version of the OGBN-papers100M dataset, Katana runs a training epoch 3.8x faster than DistDGL on 16 ranks (each rank is run on one socket). The figure below shows the speedup of Katana over DistDGL on 4 to 16 ranks (we were unable to run DistDGL on 1 and 2 ranks): Katana significantly outperforms DistDGL at all ranks.

Katana’s performance can mainly be attributed to the benefits that multi-minibatch sampling provides. The speedup of Katana over DistDGL in the various phases of the epoch are shown below. Here, the preparation time which includes sampling, feature fetching, and sampled subgraph construction is 4.7x faster in Katana than DistDGL. We also observe improvement in the forward compute time which can be attributed to our native layer implementation. The speedup in backward compute time can be attributed to imbalance in preparation time among hosts since backward compute has a global distributed barrier which causes the slowest rank to bottleneck all other ranks. Given that DistDGL’s preparation time is greater than Katana’s, imbalance among ranks will have a negative effect on the measured backward time.

We also integrated the Katana sampler with an Intel-developed GraphSAGE layer that uses bfloat16 half-precision floats (BF16) and compared the runtime to the Katana sampler with the native layer (FP32). The speedup of the BF16 training time is shown below. This speedup is the result of communicating half-precision features (i.e., half the amount of data) and doing faster computation with half-precision numbers using the Intel Xeon’s half-precision compute units.

The improved epoch training time due to native layers and multi-mini batch sampling lead to a significant improvement over DistDGL. Coupled with faster partitioning with CuSP, this leads to improved end-to-end performance which reduces the cost of training GNNs. From a business perspective, improved end-to-end performance is advantageous because it reduces the cost of training GNNs. It also permits data scientists to iterate through the design space of GNNs more quickly, making it possible to find better models more quickly and at reduced expense.
Disclaimers
DGL has since released v0.9.1. DistDGL is using 1 sampler per rank. We used the epoch times of the 3rd epoch for the reported numbers.
Citations
[1] Loc Hoang, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics. IPDPS 2019.
[2] Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. DistDGL: Distributed graph neural network training for billion-scale
graphs, 2021. https://arxiv.org/abs/2010.05337