ETL with the Katana Graph Library

By: Katana Graph

June 09, 2022

ETL with the Katana Graph Library

ETL (extract, transform, and load) refers to the process data engineers use to pull data from diverse sources, transform it into a usable form for the desired process, and then load it into a system where data scientists and others can use it — for example, in predictive analytics. Much data analysis relies on ETL, but with the exponential increase in data volume and number of data sources, ETL is often a painful but necessary aspect of data science.

Katana Graph provides an open-source Python analytics library, freely available under the 3-Clause BSD license. You can find it on GitHub and or Anaconda.org. The library is interoperable with scikit-learn, pandas, and Apache Arrow. Installation is easy with the open-source conda Python environment manager:

$ conda install -c katanagraph/label/dev -c conda-forge katana-python

Katana's Python library supports a variety of data formats for ETL, including NumPy arrays, adjacency matrices, pandas DataFrames, edge lists, GraphML, and NetworkX. We'll show a few simple examples here. First, a Python program must import the libraries installed by the command above and some that they depend on:

import numpy as np

import pandas

from katana.local import Graph

from katana.local.import_data import (

     from_adjacency_matrix,

     from_edge_list_arrays,

     from_edge_list_dataframe,

     from_edge_list_matrix,

     from_graphml)

For most import routines, you pass the data to import to the appropriate function and it creates a new graph populated with the data that you passed it. Here we see this with a pandas DataFrame:

katana_graph = from_edge_list_dataframe(

                    pandas.DataFrame(dict(source=[10, 20, 30],

                                          destination=[1, 2, 0],

                                     prop = [1, 2, 3])))

Sample input from an adjacency matrix:

katana_graph = from_adjacency_matrix(

                    np.array([[0, 1, 0], [0, 0, 2], [3, 0, 0]]))

Sample input from an edge list:

katana_graph = from_edge_list_arrays(

                    np.array([0, 1, 10]), np.array([1, 2, 0]),

                    prop = np.array([1, 2, 3]))

Sample input from a Pandas DataFrame:

katana_graph = from_edge_list_dataframe(

                    pandas.DataFrame(dict(source=[0, 1, 10],

                                          destination=[1, 2, 0],

                                     prop = [1, 2, 3])))

Sample input from GraphML, in which input_file stores a file in the GraphML file format:

katana_graph = from_graphml(input_file)

Having loaded the data, executing a graph analytics algorithm is now possible. For example, computing betweenness centrality of an input graph can be executed as follows:

import katana.local

from katana.example_utils import get_input

from katana.property_graph import PropertyGraph

from katana.analytics import betweenness_centrality,

                             BetweennessCentralityPlan,

                             BetweennessCentralityStatistics

katana.local.initialize()

property_name = "betweenness_centrality"

betweenness_centrality(katana_graph, property_name, 16,

                       BetweennessCentralityPlan.outer())

stats = BetweennessCentralityStatistics(g, property_name)

print("Minimum Centrality:", stats.min_centrality)

print("Maximum Centrality:", stats.max_centrality)

print("Average Centrality:", stats.average_centrality)

In addition to the prepackaged routines, data scientists can also write their own graph algorithms using a Python interface that exposes Katana Graph's concurrent data structures and parallel loop constructs via its optimized C++ engine. Future posts will further discuss the Katana Graph API and Metagraph support.

There are two common ways to use the Katana GitHub library described above. One way is to copy this repository into your own CMake project and use a git submodule. The other common method is to install the library outside your project and import it as a CMake package. The Katana CMake package is available through the katana-cpp Conda package, which is a dependency of the katana-python Conda package.

The Katana Graph Intelligence Platform aims to make parallel graph analytics and machine learning algorithms convenient and efficient. Take a look at the complete Katana Graph Python analytics library.

share

Newsletter Sign Up

Optimizing Large-Scale Distributed Graph Neural Networks on Intel CPUs

Training Graph Neural Networks on Large Graphs Graph neural networks (GNNs) are a powerful tool for.

Read More
Rethinking Buyer Behavior Algorithms

To standard traffic analyzers, one click is as good as another. Our impulse purchases and our most.

Read More
Katana Graph’s Analytics Python Library

As businesses grow and face increasing data challenges, they must find ways to tackle more.

Read More

View All Resources

Let’s Talk

Turn Your Unmanageable
Data Into Answers

Find out how Katana Graph can help provide the foundation for your future of data-driven innovation.

Contact Sales