Measuring Set Similarity with Jaccard Coefficients

By: Katana Graph

May 18, 2022

Measuring Set Similarity with Jaccard Coefficients

A Jaccard Coefficient is a measure between 0 and 1 (representing 0 to 100%) that quantifies the similarity between two sample sets. It compares members of each set to see which are shared and which are unique to each of the sets and the higher the coefficient, the more similar the two sets are deemed to be.

The metric was introduced in 1901 by Paul Jaccard, a professor of botany and plant physiology, who was interested in determining the genus and species of plants by categorizing plant specimens and their associated attributes.

There are a number of methods to measure similarity, all of which have their strengths and weaknesses. In the graph computing and analytics space, Jaccard is a common metric in recommender systems, so it is a good place to start.

The Jaccard Coefficient (aka Jaccard index) identifies the similarity between finite sample sets by measuring the size of intersections and unions of two discrete populations or sample sets. Datasets sharing identical members would have a coefficient of 1 and datasets with no shared common members would have a coefficient of 0.

Formally, the Jaccard Coefficient is defined as follows:

As an example, consider the following two sets of data:

a = [Ann, Bob, Carol, Dean, Elle, Fred]

b = [Ann, Carol, Dean, Elle, Tom, Uma, Will, Zoey]

The Jaccard Coefficient works for datasets with different numbers of members, but using Jaccard to compare sets with significantly differing volumes of elements is a topic for a later discussion. For now, we’ll use this simple example. To calculate the Jaccard Coefficient of the two sets, we first count the total number of shared members in both datasets, then divide by the total number of distinct members in both datasets combined. In set-theory language, the metric can be expressed as the union size over the intersection size of two sets.

Number of shared members (intersection): {Ann, Carol, Dean, Elle} = 4

Total distinct members (union): {Ann, Bob, Carol, Dean, Elle, Fred, Tom, Uma, Will, Zoey} = 10

Jaccard Coefficient (similarity) = 4 / 10 = 0.4

It is sometimes more useful to think of the dissimilarity between two data sets, often called the Jaccard Distance. The Jaccard Distance is simply one minus the similarity—that is, one minus the Jaccard Coefficient, so in this case, the Jaccard Distance is 0.6.

To calculate the Jaccard Coefficient using Python, we can exploit the statistical functions within the NumPy extension module.

import numpy as np

#define Jaccard Similarity function
def jaccard(set1, set2):
    intersection = len(list(set(set1).intersection(set2)))
    union = (len(set1) + len(set2)) - intersection
    return float(intersection) / union

#calculate Jaccard Coefficient of the two sets
y = jaccard(a, b)
# returns 0.4

Even simpler, we could use NumPy’s boolean logic functions:

def jaccard_binary(set1, set2):
    intersection = np.logical_and(set1, set2
    union = np.logical_or(set1, set2)
    return intersection.sum() / float(union.sum())

At its very simplest, we might use the Jaccard Coefficient as a basis for cross-recommending products to two sets of buyers by setting a similarity threshold of 0.8 as being the minimum meaningful similarity for a valid recommendation. Below that, buyers might feel they are being spammed. In the above example, if Set A represents those who bought hammers and Set B represents those who bought shovels, we would conclude that there is not enough similarity between hammer and shovel users to warrant recommending shovels to hammer buyers.

There are potential flaws in such thinking, though. To start, tool users and tool buyers might be different sets of people (for example, Father’s Day advertisements are not directed at fathers), and inferring usage from purchases can be an overreach. A safer inference would be that shovel buyers (as opposed to users) are not sufficiently similar to hammer buyers to warrant cross recommendations.

Another common use of similarity metrics is in object detection in images. Convolutional neural networks, commonly used in image identification applications, use Jaccard Coefficient as a means of quantifying the accuracy of object detection. For example, if a computer vision algorithm is tasked with detecting letters and numerals in an image, the Jaccard Coefficient quantifies the similarities between the computer's identification of each detected character and a known set of images of letters and numerals.

In text mining, similarity metrics are used to quantify the similarity between the subject matter of two documents by comparing the similarity of the non-trivial terms (i.e. not “function words”) used in the documents. In product, film, and music recommendation systems, more complex algorithms derived from the Jaccard Coefficient are used to identify similar customers based on purchases as well as similar products based on their buyers. In a future blog post we’ll look at how and when the Jaccard Coefficient fails to give good results as well as similarity calculations in graphs.

Jaccard similarity is just one of the growing list of analytic algorithms supported by Katana Graph. With a few lines of Python code, you too can find Jaccard Coefficients to compare sets in your data and derive insights based on the similarity values reported by this algorithm.

About Katana Graph

Katana Graph provides solutions to help your business make timely decisions. Interested in what we can do for your business? Contact us to schedule a time to talk.

share

Newsletter Sign Up

Subgraph Extraction

Graph mining is a growing field of study, and its use applies to many real-world applications. In.

Read More
ETL with the Katana Graph Library

ETL (extract, transform, and load) refers to the process data engineers use to pull data from.

Read More
What Is PageRank?

The PageRank algorithm uses incoming links between a graph’s nodes to rank the nodes, giving nodes.

Read More

View All Resources

Let’s Talk

Turn Your Unmanageable
Data Into Answers

Find out how Katana Graph can help provide the foundation for your future of data-driven innovation.

Contact Sales