Similarity metrics — quantification of how similar two entities are — exist at the core of important machine learning systems, including those aimed at recommendation systems, spam filtering, text mining and natural language processing (NLP), computer vision and facial recognition, clustering (for example, customer segment analysis), and in-pattern recognition in general. Since similarity is a subjective human concept, various interpretations of similarity exist and are selectively put to work for various kinds of machine learning tasks.

In earlier posts on Jaccard and Euclidean similarities, we introduced similarity metrics featured in machine learning techniques used for recommendation systems and text mining. In these posts, we described Euclidean similarity calculation as one means of quantifying the extent to which two entities are similar. We also listed some scenarios where specific similarity metrics give bad or counterintuitive results.

Euclidean similarity worked well for comparing the preferences of movie viewers based on their film ratings. Film ratings are bounded; using Rotten Tomatoes scoring, all viewers normalize their range of affinity for films by constraining their ratings between 1 and 10. On the other hand, the count of occurrences of nouns in various documents is essentially unbounded; it can vary greatly depending on document length. So judging two documents' similarity on the basis of nouns appearing in both — even a great many shared nouns — can be misleading.

In cases like text analysis, for example, in creating “see also” recommendations, Cosine Similarity is usually a better metric. In the movie rating example used previously, Cosine Similarity measures the angle between vectors from the graph origin to the points representing each viewer’s rating pair for "Citizen Kane" and "Dark Knight".

By this metric, Jill and Ann would be judged more similar than any other pairing of viewers, despite their widely different ratings. Despite their rating differences, we can see that Jill and Ann both like "Citizen Kane" about the same amount that they like "Dark Knight"; their relative preferences are roughly equal. When comparing documents of arbitrary length, Cosine Similarity is preferred due to the relative frequencies of words common to a pair of documents.

Cosine similarity, a measure of the cosine of the angle between two vectors, can be calculated as the dot product of the vectors divided by the product of the norms of the vectors:

In Python, we could calculate Cosine Similarity with a function built from the linear algebra functions in the NumPy module:

import numpy as np

from numpy import dot

from numpy.linalg import norm

#define Cosine Similarity function

def cosSim(set1, set2):

return dot(set1, set2) / (norm(set1) * norm(set2))

#calculate Cosine Similarity of the two sets

a = [3, 2, 1, 4]

b = [1, 3, 8, 2]

y = cosSim(a, b)

# returns 0.517

Manually checking our logic with linear algebra:

Dot product of *a* and *b* = 3*1 + 2*3 + 1*8 + 4*2 = 25

Norm(*a*) = sqrt(3^{2} + 2^{2} + 1^{2} + 4^{2}) = sqrt(30) = 5.477

Norm(*b*) = sqrt(1^{2} + 3^{2} + 8^{2} + 2^{2}) = sqrt(78) = 8.832

cosSim(*a,b*) = 25 / (5.477 * 8.832) = 0.517

Python users should also note that the SciPy module contains a cosine similarity function (spatial.distance.cosine) as does sklearn (cosine_similarity)

Other similarity metrics such as Minkowski, Pearson, and Manhattan exist for certain scenarios such as when the number of dimensions in the problem space is very large. These metrics are conceptually similar to Euclidean and Cosine similarities and are available in libraries like SciPy.

Many algorithms perform poorly, i.e. do not give results consistent with our intuitions about similarity, with sparse data. In recommendation systems, a vexing problem exists when the system cannot make any inferences for a user having no prior product purchases or when a new product having no prior ratings is introduced to the system. This is the *cold start problem*. We’ll look at mitigation strategies in a future post.