# Measuring Set Similarity with Jaccard Coefficients

May 18, 2022 A Jaccard Coefficient is a measure between 0 and 1 (representing 0 to 100%) that quantifies the similarity between two sample sets. It compares members of each set to see which are shared and which are unique to each of the sets and the higher the coefficient, the more similar the two sets are deemed to be.

The metric was introduced in 1901 by Paul Jaccard, a professor of botany and plant physiology, who was interested in determining the genus and species of plants by categorizing plant specimens and their associated attributes.

There are a number of methods to measure similarity, all of which have their strengths and weaknesses. In the graph computing and analytics space, Jaccard is a common metric in recommender systems, so it is a good place to start.

The Jaccard Coefficient (aka Jaccard index) identifies the similarity between finite sample sets by measuring the size of intersections and unions of two discrete populations or sample sets. Datasets sharing identical members would have a coefficient of 1 and datasets with no shared common members would have a coefficient of 0.

Formally, the Jaccard Coefficient is defined as follows: As an example, consider the following two sets of data:

a = [Ann, Bob, Carol, Dean, Elle, Fred]

b = [Ann, Carol, Dean, Elle, Tom, Uma, Will, Zoey]

The Jaccard Coefficient works for datasets with different numbers of members, but using Jaccard to compare sets with significantly differing volumes of elements is a topic for a later discussion. For now, we’ll use this simple example. To calculate the Jaccard Coefficient of the two sets, we first count the total number of shared members in both datasets, then divide by the total number of distinct members in both datasets combined. In set-theory language, the metric can be expressed as the union size over the intersection size of two sets.

Number of shared members (intersection): {Ann, Carol, Dean, Elle} = 4

Total distinct members (union): {Ann, Bob, Carol, Dean, Elle, Fred, Tom, Uma, Will, Zoey} = 10

Jaccard Coefficient (similarity) = 4 / 10 = 0.4

It is sometimes more useful to think of the dissimilarity between two data sets, often called the Jaccard Distance. The Jaccard Distance is simply one minus the similarity—that is, one minus the Jaccard Coefficient, so in this case, the Jaccard Distance is 0.6.

To calculate the Jaccard Coefficient using Python, we can exploit the statistical functions within the NumPy extension module.

import numpy as np

#define Jaccard Similarity function
def jaccard(set1, set2):
intersection = len(list(set(set1).intersection(set2)))
union = (len(set1) + len(set2)) - intersection
return float(intersection) / union

#calculate Jaccard Coefficient of the two sets
y = jaccard(a, b)
# returns 0.4

Even simpler, we could use NumPy’s boolean logic functions:

def jaccard_binary(set1, set2):
intersection = np.logical_and(set1, set2
union = np.logical_or(set1, set2)
return intersection.sum() / float(union.sum())

At its very simplest, we might use the Jaccard Coefficient as a basis for cross-recommending products to two sets of buyers by setting a similarity threshold of 0.8 as being the minimum meaningful similarity for a valid recommendation. Below that, buyers might feel they are being spammed. In the above example, if Set A represents those who bought hammers and Set B represents those who bought shovels, we would conclude that there is not enough similarity between hammer and shovel users to warrant recommending shovels to hammer buyers.

There are potential flaws in such thinking, though. To start, tool users and tool buyers might be different sets of people (for example, Father’s Day advertisements are not directed at fathers), and inferring usage from purchases can be an overreach. A safer inference would be that shovel buyers (as opposed to users) are not sufficiently similar to hammer buyers to warrant cross recommendations.

Another common use of similarity metrics is in object detection in images. Convolutional neural networks, commonly used in image identification applications, use Jaccard Coefficient as a means of quantifying the accuracy of object detection. For example, if a computer vision algorithm is tasked with detecting letters and numerals in an image, the Jaccard Coefficient quantifies the similarities between the computer's identification of each detected character and a known set of images of letters and numerals.

In text mining, similarity metrics are used to quantify the similarity between the subject matter of two documents by comparing the similarity of the non-trivial terms (i.e. not “function words”) used in the documents. In product, film, and music recommendation systems, more complex algorithms derived from the Jaccard Coefficient are used to identify similar customers based on purchases as well as similar products based on their buyers. In a future blog post we’ll look at how and when the Jaccard Coefficient fails to give good results as well as similarity calculations in graphs.

Jaccard similarity is just one of the growing list of analytic algorithms supported by Katana Graph. With a few lines of Python code, you too can find Jaccard Coefficients to compare sets in your data and derive insights based on the similarity values reported by this algorithm.

share

###### Optimizing Large-Scale Distributed Graph Neural Networks on Intel CPUs

Training Graph Neural Networks on Large Graphs Graph neural networks (GNNs) are a powerful tool for.

To standard traffic analyzers, one click is as good as another. Our impulse purchases and our most.