ETL (extract, transform, and load) refers to the process data engineers use to pull data from diverse sources, transform it into a usable form for the desired process, and then load it into a system where data scientists and others can use it — for example, in predictive analytics. Much data analysis relies on ETL, but with the exponential increase in data volume and number of data sources, ETL is often a painful but necessary aspect of data science.
Katana Graph provides an open-source Python analytics library, freely available under the 3-Clause BSD license. You can find it on GitHub and or Anaconda.org. The library is interoperable with scikit-learn, pandas, and Apache Arrow. Installation is easy with the open-source conda Python environment manager:
$ conda install -c katanagraph/label/dev -c conda-forge katana-python
Katana's Python library supports a variety of data formats for ETL, including NumPy arrays, adjacency matrices, pandas DataFrames, edge lists, GraphML, and NetworkX. We'll show a few simple examples here. First, a Python program must import the libraries installed by the command above and some that they depend on:
import numpy as np
import pandas
from katana.local import Graph
from katana.local.import_data import (
from_adjacency_matrix,
from_edge_list_arrays,
from_edge_list_dataframe,
from_edge_list_matrix,
from_graphml)
For most import routines, you pass the data to import to the appropriate function and it creates a new graph populated with the data that you passed it. Here we see this with a pandas DataFrame:
katana_graph = from_edge_list_dataframe(
pandas.DataFrame(dict(source=[10, 20, 30],
destination=[1, 2, 0],
prop = [1, 2, 3])))
Sample input from an adjacency matrix:
katana_graph = from_adjacency_matrix(
np.array([[0, 1, 0], [0, 0, 2], [3, 0, 0]]))
Sample input from an edge list:
katana_graph = from_edge_list_arrays(
np.array([0, 1, 10]), np.array([1, 2, 0]),
prop = np.array([1, 2, 3]))
Sample input from a Pandas DataFrame:
katana_graph = from_edge_list_dataframe(
pandas.DataFrame(dict(source=[0, 1, 10],
destination=[1, 2, 0],
prop = [1, 2, 3])))
Sample input from GraphML, in which input_file stores a file in the GraphML file format:
katana_graph = from_graphml(input_file)
Having loaded the data, executing a graph analytics algorithm is now possible. For example, computing betweenness centrality of an input graph can be executed as follows:
import katana.local
from katana.example_utils import get_input
from katana.property_graph import PropertyGraph
from katana.analytics import betweenness_centrality,
BetweennessCentralityPlan,
BetweennessCentralityStatistics
katana.local.initialize()
property_name = "betweenness_centrality"
betweenness_centrality(katana_graph, property_name, 16,
BetweennessCentralityPlan.outer())
stats = BetweennessCentralityStatistics(g, property_name)
print("Minimum Centrality:", stats.min_centrality)
print("Maximum Centrality:", stats.max_centrality)
print("Average Centrality:", stats.average_centrality)
In addition to the prepackaged routines, data scientists can also write their own graph algorithms using a Python interface that exposes Katana Graph's concurrent data structures and parallel loop constructs via its optimized C++ engine. Future posts will further discuss the Katana Graph API and Metagraph support.
There are two common ways to use the Katana GitHub library described above. One way is to copy this repository into your own CMake project and use a git submodule. The other common method is to install the library outside your project and import it as a CMake package. The Katana CMake package is available through the katana-cpp Conda package, which is a dependency of the katana-python Conda package.
The Katana Graph Intelligence Platform aims to make parallel graph analytics and machine learning algorithms convenient and efficient. Take a look at the complete Katana Graph Python analytics library.