Similarity Metrics for Machine Learning

By: Katana Graph

May 25, 2022

Similarity Metrics for Machine Learning

All recommendation systems require some sort of similarity metric, and the vast majority of recommendation systems employ machine learning techniques that involve similarity calculations. Items or people could be judged as similar by comparing attributes of entities that have been assigned by curators.

A recent post discussed the Jaccard Coefficient and its use in recommender systems. The Jaccard algorithm bases similarity evaluations on the number of elements common to two sets, usually relying on the idea of whether two things being compared are identical or not. For example, the John Doe playing bass in the band is the same John Doe who plays shortstop for the softball team. Likewise, the term “machine learning” appearing in one article is the same term that appears in another. Both examples show identical sameness, despite the logical differences between the two cases.

Jaccard similarity involves a binary comparison (two things are either identical or not) between elements of a set, resulting in a quantitative similarity determination of sets containing them. Other similarity metrics involve degrees of similarity between elements of sets, allowing objects to be more or less similar to each other. Bob and Jill might be judged to be similar because they watched all the same movies, while watching or not watching is a binary comparison.

But Jill loved "Citizen Kane" and Bob thought it was just OK, rating it a 5 out of 10. We could convert all their reviews to either thumbs up or thumbs down — and again calculate their Jaccard similarity — but that would lose valuable distinctions about their tastes. A better approach would be to use a similarity metric that made use of their degree of preference such as Bob and Jill’s numerical ratings of movies.

So, similarity metrics can be based on degrees and magnitudes or on yes/no (equal/unequal) comparisons depending on the need and availability. Further, within these categories, we can choose from a variety of metrics depending on their relative strengths and weaknesses in specific scenarios.

Among metrics that determine similarity by simple membership in sets, like Jaccard, there are situations that confound each. Consider the Jaccard Coefficient for Tim and Ann in the social graph below. Tim is connected to both Bob and Jill; Ann is connected to them also, but not to Tim. The Jaccard Coefficient for Tim and Ann is JC(Tim, Ann) = 2 / 2 = 1.0. This makes sense to us; we think Tim and Ann are similar since they share a social preference for Bob and Jill.

TimAnnBobJill

But now let’s compare Tim and Jill. Their lists of connections include no common members. Tim: {Jill, Bob}. Jill: {Tim, Ann}. JC = 0 / 4 = 0.0.

Logically, something seems wrong in judging that there is no similarity between Tim and Jill. Many would consider Tim and Jill similar on the basis of their social network. In this instance, Jaccard is limited by solely relying on directly connected vertex pairs and a similarity metric based on distances and degrees of separation seems more reasonable.

Distance is the count of edges in the shortest path between two vertices. In the above case, the distance between Tim and Ann is 2, even though there are two different shortest paths, one through Bob and one through Jill. Viewing similarity as a matter of distance returns us to similarity metrics that include scalar values assigned to relationships (Bob rates "Citizen Kane" 5 on a scale of 10) as opposed to mere membership in a set (Tim is connected to Bob). Common metrics of this type include Euclidean Distance, Cosine, and the Pearson Correlation Coefficient, which will be discussed in a later post.

In recommender systems, we typically compare each component with every other component to assign a similarity score to every pair of people. This entails collecting at least one common piece of data for everyone being compared. Popular movie ratings can be converted to scalar values; both IMDb and Rotten Tomatoes use a scale from 1 to 10. The San Francisco Chronicle uses 1 to 5, and Siskel and Ebert used a nine-point scale of 0 to 4 stars including half stars.

A simple way to calculate similarity for ratings in a given range is Euclidean Distance. Consider the movie ratings for the group of Bob, Jill, Tim, and Ann. Imagine that they have all seen both "Citizen Kane" and "The Dark Knight". Using Rotten Tomatoes scoring, we could plot their ratings in Euclidean space as follows:

MovieRatingsChart

In this grid of preference space (note that we don’t call it a graph) it is immediately apparent that Tim and Jill are more similar in their taste for "Dark Knight" and "Citizen Kane" than Ann and Bob are. We can quantify this similarity by computing the Euclidean distance between each person on this plot.

The Euclidean distance between two points in the plot is the length of a diagonal line connecting them. It is the square root of the sum of the squares of the difference between their x values and their y values. In python:

from math import sqrt
distance = sqrt(pow(x2 - x1) + pow(y2 - y1))

Unlike the case with Jaccard similarity and Jaccard difference (dissimilarity), we cannot simply subtract the distance from 1 to get the similarity figure because Euclidean distances are not bounded between 0 and 1. A common means of addressing this problem is to define Euclidean similarity as 1 over the distance. Since a distance can be 0, however, it’s common to add 1 — thereby preventing division by zero — to the distance first, so that the value is always between 0 and 1:

euclidean_similarity = 1 / (1 + sqrt(pow(x2 - x1) + pow(y2 - y1)))

Once similarity is calculated for each pair of movie viewers, we can rank them in order of preference similarity. For a pair of people having similar preferences (similar Euclidean similarity values) for a large number of movies, we can confidently recommend to each of them that they might like any movie liked only by the other.

Euclidean distance is central to many classification methods, such as the k-nearest neighbor (k-nn or KNN) algorithm. But there are cases, text mining being a noted one, where Euclidean distance is a poor indicator of similarity. What if the recommendation system is growing rapidly in popularity and has a large number of young members who have reviewed very few movies? On the other end of the spectrum, a mature system may end up making recommendations that a user is already very familiar with: “hey Ariana Grande fan, you might also like Taylor Swift.” Or, what if a recommender system has millions of users and a large catalog, such as a B2C sales site, each of whom has very few purchases or product ratings? This is the sparse data problem. We’ll discuss alternatives to Euclidean Similarity and ways around some of these common data problems in a future post.

share

Newsletter Sign Up

Rethinking Buyer Behavior Algorithms

To standard traffic analyzers, one click is as good as another. Our impulse purchases and our most.

Read More
Katana Graph’s Analytics Python Library

As businesses grow and face increasing data challenges, they must find ways to tackle more.

Read More
K-Core and K-Truss Algorithms

K-core and k-truss algorithms assist with community search in large graphs and are used to identify.

Read More

View All Resources

Let’s Talk

Turn Your Unmanageable
Data Into Answers

Find out how Katana Graph can help provide the foundation for your future of data-driven innovation.

Contact Sales