Optimizing Large-Scale Distributed Graph Neural Networks on Intel CPUs

By: Katana Graph

January 23, 2023

Optimizing Large-Scale Distributed Graph Neural Networks on Intel CPUs

Training Graph Neural Networks on Large Graphs

Graph neural networks (GNNs) are a powerful tool for applying machine learning techniques to graph data. They allow us to generate or transform features on vertices and edges using deep neural networks to make them useful for a wide range of applications including fraud detection, product recommendation, and drug discovery. 

One of the challenges with using GNNs is that real-world graphs such as social networks and financial transaction graphs can become very large with billions of vertices and trillions of edges. These graphs may not fit on a single machine as they can easily reach terabyte scale. Training such GNNs in reasonable time requires the use of high-performance processors and distributed systems.

To address this challenge, Katana Graph and Intel have partnered to create the world's fastest distributed CPU-based GNN system by utilizing Katana Graph's scale-out graph computation engine and Intel's 4th-generation Xeon CPUs. This system delivers record-breaking performance on very large graphs. In this article, we describe the system’s major components and their contribution to overall performance.

Streaming Partitioner

Training GNNs on large graphs requires partitioning the dataset using a cluster of machines. Many existing systems use offline graph partitioners, a category that includes hierarchical graph partitioners like Metis and spectral graph partitioners that solve eigenvector problems to find the partitions. These partitioners attempt to optimize communication and load balance for sparse matrix-vector products but the resulting partitions are not necessarily optimized for GNN workloads. Moreover, the overhead in computing the partitions can be substantial.

Katana Graph offers a seamless and efficient partitioner called the Customizable Streaming Partitioner (CuSP) [1], which allows for quick and flexible partitioning of the graph in a streaming manner. Unlike offline partitioners like METIS, CuSP does not continually refine owner assignments of vertices and edges, which is a time-consuming process. This may result in potentially lower-quality partitions, but as a tradeoff, it can have significantly faster partitioning time, especially for very large graphs. CuSP supports arbitrary partitioning policies including edge-cuts and vertex-cuts, giving users the flexibility to choose the best policy for their particular GNN. Katana Graph also augments CuSP’s efficient partitioning with efficient communication of feature vector properties. This is essential for graphs used with GNNs which often have large input feature vectors.

Multi-Minibatch Sampling and Feature Fetching

Existing systems for training GNNs on large graphs often sample one minibatch at a time, which can lead to missed opportunities for performance optimization through data reuse across minibatches. For example, multiple minibatches may include the same nodes, but their features may get fetched multiple times across different minibatches.

Katana Graph addresses this issue with its multi-minibatch sampler, which reduces sampling and property fetching overhead by merging the communication that would occur over multiple minibatches into a single multi-minibatch. This is made possible by leveraging excess memory on CPU machines to fetch the sampled subgraph and features for many minibatches at once instead of just one minibatch at a time.

As an example, consider an epoch with M minibatches. Normally, M minibatches would require M communication relays (i.e., back-and-forth requests for sampling for each GNN layer), in which the L-hop sampled subgraph and its features are requested and fetched from other devices. These relays cause the process responsible for communication to block for request responses and limit the overall execution speed. Instead of performing M relays for M minibatches, Katana Graph does one relay for B minibatches at once, reducing the number of communication relays by a factor of B. This has two main benefits: (1) the number of times the sampler process needs to block waiting for messages is reduced by a factor of B, and (2) vertex properties that would have been redundantly fetched in different minibatches are fetched once per B minibatches. The first benefit follows from the reduction in the number of relays, and the second benefit follows from the design of Katana Graph's feature fetching, which fetches features only after the vertices of all B subgraphs are known. This process is illustrated in the figure below:


Training M Minibatches in an Epoch

Reducing Overhead with Native GNN Layers

Katana Graph's GNN system uses PyTorch for training neural networks, which relies on a system called Autograd to compute gradients used to update the GNN model. However, Autograd does not support in-place operations in the GNN computation, which can cause additional memory allocation and slow down execution. In addition, the allocation of intermediate memory buffers for use in the backward computation graph may add overhead to end-to-end training time.

To address this issue, Katana Graph has implemented native GNN layers for GraphSAGE, ReLU and dropout that: (1) preallocate slightly more memory than is required as buffer space for intermediate tensors in the GNN computation and reuse it for all minibatches instead of allocating and deallocating it for every minibatch, and (2) manually track the information required to perform the backward phase for GraphSAGE, ReLU, and dropout in order to allow for the use of in-place operations that Autograd cannot handle natively. These layers are integrated directly into Autograd so that the backward phase can seamlessly occur with other PyTorch modules. This approach minimizes memory allocation overhead via reuse of existing memory and ensures that reallocation only occurs when a larger minibatch than has been previously seen is processed. Once the largest possible minibatch has been allocated for, no further reallocation is necessary.

Experimental Results

We compared Katana Graph’s GNN system against DistDGL v0.9.0 [2] in a distributed setting on 8 Intel 4th-generation Xeon machines each with 2 CPU sockets. We ran 1 training rank on each socket, so we show results for up to 16 ranks. We run a 3-layer GraphSAGE model with fanout 15, 10, and 5 (15 being the fanout of the seed nodes) and a hidden layer size of 256. The minibatch size used is 1000.

On a symmetric version of the OGBN-papers100M dataset, Katana runs a training epoch 3.8x faster than DistDGL on 16 ranks (each rank is run on one socket). The figure below shows the speedup of Katana over DistDGL on 4 to 16 ranks (we were unable to run DistDGL on 1 and 2 ranks): Katana significantly outperforms DistDGL at all ranks.


Chart Multi Minigraph Single Rank


Katana’s performance can mainly be attributed to the benefits that multi-minibatch sampling provides. The speedup of Katana over DistDGL in the various phases of the epoch are shown below. Here, the preparation time which includes sampling, feature fetching, and sampled subgraph construction is 4.7x faster in Katana than DistDGL. We also observe improvement in the forward compute time which can be attributed to our native layer implementation. The speedup in backward compute time can be attributed to imbalance in preparation time among hosts since backward compute has a global distributed barrier which causes the slowest rank to bottleneck all other ranks. Given that DistDGL’s preparation time is greater than Katana’s, imbalance among ranks will have a negative effect on the measured backward time.


Chart Multi Minigraph 16 Rank


We also integrated the Katana sampler with an Intel-developed GraphSAGE layer that uses bfloat16 half-precision floats (BF16) and compared the runtime to the Katana sampler with the native layer (FP32). The speedup of the BF16 training time is shown below. This speedup is the result of communicating half-precision features (i.e., half the amount of data) and doing faster computation with half-precision numbers using the Intel Xeon’s half-precision compute units.


Chart Multi Minigraph 16 - 32 Rank


The improved epoch training time due to native layers and multi-mini batch sampling lead to a significant improvement over DistDGL. Coupled with faster partitioning with CuSP, this leads to improved end-to-end performance which reduces the cost of training GNNs. From a business perspective, improved end-to-end performance is advantageous because it reduces the cost of training GNNs. It also permits data scientists to iterate through the design space of GNNs more quickly, making it possible to find better models more quickly and at reduced expense.


DGL has since released v0.9.1. DistDGL is using 1 sampler per rank. We used the epoch times of the 3rd epoch for the reported numbers.


[1] Loc Hoang, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics. IPDPS 2019.

[2] Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. DistDGL: Distributed graph neural network training for billion-scale
graphs, 2021. https://arxiv.org/abs/2010.05337


Newsletter Sign Up

Rethinking Buyer Behavior Algorithms

To standard traffic analyzers, one click is as good as another. Our impulse purchases and our most.

Read More
Katana Graph’s Analytics Python Library

As businesses grow and face increasing data challenges, they must find ways to tackle more.

Read More
K-Core and K-Truss Algorithms

K-core and k-truss algorithms assist with community search in large graphs and are used to identify.

Read More

View All Resources

Let’s Talk

Turn Your Unmanageable
Data Into Answers

Find out how Katana Graph can help provide the foundation for your future of data-driven innovation.

Contact Sales