Graph approaches are universally acknowledged for their ability to provide a contextualized understanding of the relationships between the diverse data they process. However, many options in this space are entirely dependent upon graph databases for their core functionality. Although such databases deliver undisputed enterprise utility, they’re only part of what true graph intelligence encompasses.
We’re on the verge of a new graph era that not only involves graph databases but four pillars of graph computing that can collectively provide holistic graph intelligence into data’s value for any use case. Graph querying, graph mining, graph analytics, and graph AI work together to deliver timely insights for graph intelligence.
The essential characteristics of graph intelligence include the capacity to perform these four functions in a seamless workflow on a single platform for any business problem. Data movement should be reduced to the bare amount of data required for workflows. High-Performance Computing (HPC) architecture should underpin the entire platform — without which graph intelligence isn’t possible.
The HPC gains of this approach are based on a graph engine that can scale to hundreds of machines. It consists of three elements: a partitioning engine, runtime system, and communication engine.
By performing the four functions of graph intelligence (querying, mining, analytics, and AI) with the speed and scale of HPC, organizations get enterprise-scale graph computations in a single platform that minimizes data movement. These characteristics are the fundamental differentiators for implementing graph intelligence, which vastly exceeds the capabilities of other graph databases or platforms to do any of its core components — individually or collectively.
Graph intelligence hinges on the implementation of each of its four elements in a single framework. It’s insufficient to support only one or even two of these capabilities in a particular construct; organizations must be able to do them all to complete their workflows. Moreover, they require doing so in one place without friction between these various components. For example, if a financial services organization has one system for storage and a completely separate ecosystem for analytics, it’s forced to move data between them — a process that becomes a time-consuming bottleneck.
A medical knowledge graph use case, on the other hand, offers a good example of how graph intelligence eliminates this common problem while hinting at its vast potential. This overarching graph contains numerous data types including diseases, clinical compounds, side effects, various papers or articles about those compounds, and data about those documents like their authors. Researchers might want to, for example, find all the compounds to treat heart disease that don’t produce side effects for patients between 40 and 60 years old, and then, of those compounds, identify the ones most similar to a specific compound in the same cluster. Further analysis might list the top 10 and then predict their solubility or other properties. The complexity and specificity of this query is typical of those required to solve modern business problems at scale.
Answering this query involves searching for a smaller graph that users want to find instances of — relying on graph querying — in the full, heterogeneous knowledge graph. Thus, each compound the query returns is a potential treatment that must be scored based on its similarity to a given molecule in order to rank the compounds. External applications that interface with the platform compute these scores. Compounds should be clustered according to their structures; the returned compounds’ properties require accurate predictions so that data scientists can determine their appropriateness.
This workflow involves graph querying with the OpenCypher graph query language and graph mining techniques to answer the query, graph intelligence interfacing with custom applications, graph analytics approaches to cluster the compounds, and graph AI to predict compound properties. It exemplifies the sophisticated workflows graph intelligence can enable and the fact that having any subset of its four elements is not sufficient.
Operational Efficiency vs. Data Movement
After workflow integration, the second defining principle of graph intelligence is reducing the need to move data between system components. Decreasing data movement increases operational efficiency while reducing costs for data replication. Once data scientists have succeeded in integrating data, moving them again for data mining, for example, only delays productivity. Most other approaches (like conventional graph databases) require third-party tools for tasks like feature engineering — necessitating data movement. Minimizing data movement creates quicker time to value, lower costs, and better regulatory compliance and risk management.
When doing machine learning with just a database, users extract model features in the server that must be moved to a client processor and then moved again to a machine learning system for model training. A graph intelligence platform, however, can do all these functions in the server. For fraud detection use cases involving PII, for example, there are strict requirements about where data are; keeping them in-house fulfills these regulations while moving them potentially flouts them.
An End-to-End Platform
Integrating the workflows for graph intelligence in a single platform is the only way to cooperatively perform graph querying, mining, analytics, and AI while decreasing overall data movement. A number of different pipelines in this comprehensive platform ensure each of its respective facets isn’t isolated and easily interact with the others. Once data are imported, users should be able to do everything necessary for any given workflow within the platform. For instance, there are several steps involved in graph AI, including data preparation, data cleansing, model training data selection via graph query, feature generation, and more.
Completing each of these steps within the same framework produces two critical outcomes. The first is it eliminates bottlenecks, inefficiencies, and wasted time and energy. It also simplifies workflows. For example, some solutions are only designed to create Graph Neural Networks (GNNs) and make them run faster while completely forsaking the initial part of this workflow. These users must rely on third-party tools and more data movement. However, all workloads for graph AI run inside the graph intelligence platform, which supports pipelines for using other data science or machine learning tools like PyTorch.
Processing graph intelligence’s four elements in a single platform at enterprise speed and scale — in distributed settings — requires a graph engine based on HPC architecture. Alternative approaches, particularly conventional graph databases, lack such architecture, which involves three key modules: a partitioning engine, runtime system, and communication engine. Ideally, firms should do computations on multiple machines to reduce the latency and bottlenecks that otherwise occur when implementing graph intelligence’s four components on the same machine.
These shortfalls are the primary limitation of other graph options. Some of graph’s most well-known vendors restrict organizations to a single machine for all their computations, requiring them to buy huge machines that process several terabytes of data and still aren’t big enough for comprehensive graph intelligence. Conversely, the partitioning engine that scales graphs to hundreds of machines is directly responsible for this approach’s scale-out capability and allows companies to swiftly partition any graph in portions on individual machines to perform at enterprise scale and speed.
The runtime system schedules operations on the local machines. It and its data structure sets are optimized for multi-core processing in the graph engine based on C++. This characteristic is responsible for part of the engine’s scalability and robust performance that surpasses that of other graph platforms. Popular graph options rely on Java — which isn’t designed for HPC the way C++ is — for runtime and data structure operations, which limits their effectiveness. After a graph has been apportioned on different machines, the runtime system also reorganizes its data structures for more efficient computations.
The runtime communication engine controls the activities at the boundaries of the different partitions of the graph once they’ve been sharded or apportioned to the various machines. This engine synchronizes the actions of these machines to get the desired results from computations so the devices work in accord with each other. This module was expressly designed for graph computing and produces a number of favorable results. For example, once a graph has been shared, the neighbor of a specific node on the graph may actually be on a different machine and the communication engine may perform caching to reduce data movement. As such, parallelism is maximized and the graph intelligence platform taps into the full computational throughput of the machines.
Graph Query and Graph Mining
The HPC graph query engine architecture behind graph intelligence is the interface supporting all graph applications in distributed settings. Each of those applications involves one of the four facets of graph intelligence, which often begins with graph querying. Users invoke graph query when they’re looking for explicit patterns in the full heterogeneous graph of all their data — like the aforesaid medical knowledge graph.
Graph query involves evaluating any number of properties, edges, nodes, and varying node types, as well as a filtered graph or sub-graph, for specific user requirements. For example, the query process might entail analyzing part of the data with traditional analytics or inlaying the data for model training purposes for machine learning. Since it’s usually at the forefront of workloads, graph query is an important part of pre-processing pipelines; OpenCypher is graph intelligence’s query language.
Graph mining is the graph intelligence component that incorporates advanced graph mining algorithms at enterprise speed and scale. These were devised by experts and refined for graph settings. The mixture of these sophisticated graph mining algorithms and the aforementioned computational advantages creates graph mining that’s several orders of magnitude faster than that of competing approaches. The result is rapid query results to accelerate graph intelligence.
Graph Analytics, Graph AI
Graph analytics enables users to compute some global property on the full graph. With this dimension of graph intelligence, users create properties on the full graph and deploy graph analytics to find a specific property. Examples of graph analytics include centrality algorithms that determine the importance of a node in a graph network and PageRank, which is widely used by Google’s search engine to rank results in order of importance.
Complete graph intelligence also contains a library of out-the-box data science algorithms that are easily deployable for processing large graphs. There are also capabilities for writing customized algorithms with resources from Python to generate UDFs or custom routines. This way, data scientists can use tools they already know instead of learning proprietary tools for graph intelligence. This method also has domain-specific analytics routines for rapidly generating insight in common use cases. For example, users might deploy compound similarity or unstructured search for pharmaceuticals on a medical knowledge graph.
Graph AI has several core functions. It supports scalable training for machine learning models, allowing users to create GNNs, for example, on tens or hundreds of servers. This flexible architecture decouples storage from compute for this purpose. There’s also a tight integration of analytics, querying, and machine learning. Doing machine learning on graph data requires querying to filter data and analytics for feature extraction. The interface with Python is flexible enough to readily compose applications for analytics, querying, and machine learning, while the platform also has advanced graph machine learning models leveraging graph data connections for superior accuracy.
Graph Intelligence Defined
Graph intelligence involves all aspects of graph computing, including graph query, graph mining, graph analytics, and graph AI. It can only be implemented to realize its full value via a scalable HPC platform to perform each of these functions for complete workloads involving more than one of them. This approach minimizes data movement, maximizes efficiency, and optimizes overall productivity. Other graph frameworks lack this architecture and can only do some of the four pillars of graph intelligence. Katana Graph is the only one with the correct architecture to consistently implement graph intelligence and its spectacular insights.
This article was originally posted by Chris Rossbach, CTO at Katana Graph, to Medium.