The Katana Graph Engine (KGE) is a scale-out platform for high-speed graph analytics, pattern mining and querying on heterogeneous clusters of CPUs and GPUs, providing unmatched compute capability for processing even the largest graphs such as web-crawl graphs with billions of vertices and trillions of edges. This document focuses on the use of KGE for ultrahigh- performance graph analytics.
Figure 1: Graph Structure represents many real-life applications
What Is Graph Analytics?
In most application areas, large datasets are unstructured and sparse and can be represented as a graph in which vertices represent entities and edges represent relations between entities. Examples include the following:
- Web-search engines perform analytics on the indexable-web graph in which vertices represent webpages and edges represent hyperlinks between pages. The page-rank of vertices in this graph provides a measure of the relative importance of the corresponding webpages.
- Graph neural networks and graph convolutional networks are used extensively in machine learning for natural language processing and predicting the spread of infectious diseases.
In many recommendation systems, the relationship between users and their ratings of items of interest is represented as a bipartite graph with labeled edges in which the vertices represent users and items, and labeled edges represent ratings of items by users.
Katana Graph Engine
The KGE [8, 10] executes graph analytics applications on heterogeneous clusters of CPUs and GPUs, providing industry-leading performance and capability even on the largest graphs. Scale-out provides more aggregate computational power and DRAM for analyzing properties of large graphs than is possible on a single machine. In addition, each machine in the cluster can read a portion of the graph from secondary storage in parallel with other machines, which provides more aggregate bandwidth for loading the graph into memory.
On CPUs, the Katana runtime system optimizes program execution to exploit NUMA locality; for example, it performs NUMA-aware dynamic load balancing to ensure that computational load is spread out evenly between the cores of the CPU. On GPUs, the Katana graph engine incorporates performance optimizations for reducing the overhead of kernel launches and atomic operations.
For execution on clusters, the KGE partitions graphs between the machines in the cluster using a rich variety of graph partitioning policies including edge-cuts and vertex-cuts. Application-specific graph partitioning policies can be easily implemented in the Katana graph partitioner.
Communication is often the performance bottleneck in the distributed-memory execution of graph analytics programs. To avoid this, the KGE has a communication runtime that has been optimized for graph computing.
The Katana core graph libraries provide highly scalable concurrent data structures such as concurrent graph representations and concurrent worklists to implement work-efficient algorithms. These libraries can be used by data scientists to write new applications in Python, leaving it to the underlying system to optimize the program for distributed, heterogeneous execution.
Performance Evaluation
Both real-world and synthetically generated graphs are used in this study. Real-world graphs include social network graphs such as Twitter40, Friendster, and road networks such as Road-Europe. Other studies [8, 9] have also evaluated KGE on web-crawls such as Uk07, Clueweb12, and Wdc12. Clueweb12 and Wdc12 are one of the largest publicly available web crawls.
This study also uses synthetically generated graphs such as the LDBC datasets including Datagen-8 1-fb, Datagen-8 9-fb, and Datagen-9 4-fb which are generated using LDBC Graphalytics [5] generator.
The hardware specifications, applications, and input datasets used in each experiment are listed in their respective figure captions.

Figure 2: Performance of Katana Graph and Neo4j. AWS: 16 CPU cores; 128GB. Benchmarks: Page Rank (pr) (10 iterations) and Weakly Connected Components (WCC). Datasets: Social networks: Twitter40 (41.6 nodes; 1.5B edges) and Friendster (66M nodes; 1.8B edges).
Single-machine Results
Figure 2 and Figure 3 compare the performance of the KGE with that of the industry leader, Neo4j [1]. The KGE is 2-3 orders of magnitude faster for both real-world social networks (Twitter40 and Friendster) as well as LDBC Graphalytics [5] datasets. In addition, Katana Graph’s efficient in-memory representation enables it to handle much bigger graphs (Datagen-8_9-fb) on a single machine.

Figure 3: Performance of Katana Graph Engine, and Neo4j. AWS: 8 CPU cores; 64GB. Benchmarks: Page Rank (pr) (10 iterations) and Weakly Connected Components (WCC). Datasets: LDBC Graphalytics Datasets: Datagen-8 1-fb (2.1M nodes; 134M edges), Datagen-8 9-fb (10.5M nodes; 849M edges) and Datagen-9 4-fb (29.3M nodes; 2.3B edges). OOM (Out Of Memory).
Clusters
Figure 4 compares the performance of KGE with that of GraphX [7] (based on Spark [4]). The KGE is faster than GraphX by 21x (geomean speedup) on the same hardware platform.

Figure 4: Katana Graph Engine vs. Spark GraphX. GPU: AWS 16 machine cluster; 8 CPU Cores, 64GB DRAM each. Benchmarks: PageRank (PR) (50 iterations). Datasets: Road network: Road-Europe (173M nodes; 348M edges) and Social network: Twitter40 (41.6 nodes; 1.5B edges).
Conclusion
The KGE’s scalability on distributed platforms arises from its efficient compute engine, its support for a rich variety of graph partitioning policies, and an optimizing communication runtime for efficient cluster communication.
References
- Neo4j 4.0: Graph platform, 2020: https://neo4j.com
- P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. WWW’11
- T. L. Project. The ClueWeb12 Dataset, 2013.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI’12.
- GraphAnlytics: https://graphalytics.org
- SNAP Datasets: https://snap.stanford.edu/data/com-Friendster.html
- GraphX: https://spark.apache.org/graphx
- R. Dathathri, G. Gill, L. Hoang, H. Dang, A. Brooks, N. Dryden, M. Snir, and K. Pingali. 2018. Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics. PLDI’18
- D. Nguyen, A. Lenharth, and K. Pingali. 2013. A lightweight infrastructure for graph analytics. SOSP '13
- Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav Pingali. Single machine graph analytics on massive datasets using Intel optane DC persistent memory. VLDB 2020