Graph computing, which is much more than simply utilizing a graph database, and graph AI are capable of solving the most difficult data and analytics problems in the world. As seemingly ambitious a claim as this is, it’s relatively easy to make and significantly more difficult to prove.
Until now, that is.
Some of the most advanced research facilities (including Harvard, Stanford, and more) and technology companies (including Intel) on the planet are either tacitly or overtly saying the same thing. The indisputable proof that graph computing is a formidable approach to scalable computations for machine learning and AI centers on the molecular property prediction competition of Therapeutic Data Commons (TDC).
For the better part of two months, Katana Graph led this international data science contest for life sciences — with participants from the foregoing research institutions and others — via a tandem of graph computing and graph AI.
Even a cursory understanding of how this competition works is critical to realizing its importance for enterprise-scale data computations. It reveals why Katana’s graph computing and graph AI innovations were so successful and denotes this combination of capabilities is the way forward for solving data and analytics problems across the globe.
Molecular Property Predictions
TDC provides a number of resources designed to foster the use of machine learning and AI to create medical solutions for therapeutics. Its molecular property prediction benchmark is a worldwide contest in which various teams compete for the top spot.
Although the molecular property predictions event is just one of TDC’s many contests, it has obvious consequences for devising new pharmaceuticals. The capacity to accurately predict molecular properties hastens the drug development cycle while reducing its costs.
This event is predicated on a classic machine learning prediction problem in which TDC supplies competitors training datasets of numerous molecules and molecular structures. Teams also get a particular property to focus on, like toxicity, for example. In this case, the objective is to determine whether or not the molecules/molecular structures are toxic. For instance, hydrogen cyanide is, whereas sugar is not.
The competition is aligned with the most common pharmaceutical properties: Absorption, Distribution, Metabolical, Excretion, and Toxicity (ADMET). The challenge is, when given a new molecule that’s not in the training dataset, can teams accurately predict whether, for example, it’s toxic or not based on what their AI models learned on the training dataset. Success is critical for abbreviating time to market for new pharmaceuticals and boosting their effectiveness.
Graph Computing’s Adeptness
Because there are teams from universities and corporate sponsorships around the world, the diversity of machine learning techniques used are as varied as the competition is intense. Approaches involving ensemble modeling and gradient boosting are examples of some of the many techniques involved.
Because these methods combine the predictive power of models, they’re regarded as some of the most accurate machine learning techniques. Therefore, it’s a testament to graph computing’s viability that of 22 categories in this contest, Katana’s graph AI approach was first in eight and second in two of them for approximately two months. This methodology was either first or second for nearly half the categories for a problem people don’t immediately think of for employing graph.
Katana led the contest by creating a similarity graph and building a Graph Convolutional Network (GCN) — termed SIM GNN — by using a hyper-parameter-based threshold approach, features from chemical informatics libraries, and K-Nearest Neighbor. SIM GNN was one of the most accurate models for predicting molecular properties in this competition. A GCN is a Convolutional Neural Network (CNN) applied to a Graph Neural Network (GNN), which excels at identifying relationships between entities.
The threshold-based approach pairs two nodes (containing information about molecules) that meet or exceed a particular threshold determined by their hyper-parameters. RDKit is an example of a chemical informatics library, which provided two features for SIM GNN. Such libraries issued similarity scores between two molecules or drugs, and machine learning-ready features that captured the structural properties of specific drugs or molecules. K-Nearest Neighbor was used to identify the molecules that are the most similar based on the features supplied by the chemical informatics library.
While Katana led the contest, SIM GNN was one of the few models capitalizing on the intrinsic relationships between molecules and between molecular structures. Discerning those relationships among the different drugs and their features, then leveraging them for predictive accuracy, created the leading difference. This method is aided by the predictive prowess of CNNs, which have a lengthy history in image recognition and computer vision. Katana simply applied those capabilities to predictions about the relationships in drugs, molecules, and their features.
One of the strengths of graph AI is that graph computing — and GCNs in particular — is ideal for both determining relationships and exploiting them between different entities. This capability is based on graph structures, which may not be as easily identifiable as some of the grids CNNs are used for in certain image recognition use cases. However, graph approaches still pinpoint definite structures in datasets based on the relationships in them. GNNs excel at any use case in which firms are looking for relationships and don’t know what their structure or shape is.
Distributed Graph Computing
Katana’s world-leading methodology in TDC’s competition has several implications for applying graph computing to machine learning. The most immediate is that a scale-out, High-Performance Computing architecture is practically necessary for large data science datasets. This architecture facilitates intuitive data storage at scale, while enabling organizations to readily manipulate it for advanced predictive models. Were companies relying on a single machine for this task, they’d have increasing costs for storage, additional CPUs or GPUs, and slower time to value.
The distributed approach is faster, more scalable, and lets users begin working on the data immediately. It also supports distributed training on the graph via different machines so you can try different models on, for example, a similarity graph, molecular graph, or directly on the features themselves. Alternately, users can create a graph of just one drug to learn about it, and also work with classic machine learning models like Random Forest and Extreme Gradient Boosting for ensemble modeling.
Additional Data Science Capabilities
Moreover, organizations can use this single data flow graph framework to integrate with tools like Python, allowing data scientists to build new models with this ubiquitous platform without moving data. Katana’s graph intelligence solution also has native integrations with the most widely used libraries for crafting GNNs, such as Deep Graph Library and PyTorch Geometric. Additionally, the platform contains out-of-the-box machine learning models that can be refined for specific use cases. These functions and others helped Katana’s team get ahead in TDC’s molecular property prediction competition.
Graph computing is designed to handle the scalable analytics, AI, and data problems of our day. In this respect, Katana’s showing at TDC’s molecular property prediction competition offers undeniable proof of graph’s utility. The key takeaway, of course, is that this life sciences application is just an indicator of what graph can do in any other vertical, too. It can deliver the same sort of innovative solutions by employing all elements of graph computing, which include graph query, graph mining, graph analytics, and graph AI. Its performance at TDC’s international event, therefore, is simply a precursor to a new, better era of computing: graph computing.