Article 6D9M3 Jaccard index and jazz albums

Jaccard index and jazz albums

by
John
from John D. Cook on (#6D9M3)

kindofblue.jpg

Jaccard index is a way of measuring the similarity of sets. The Jaccard index, or Jaccard similarity coefficient, of two sets A and B is the number of elements in their intersection, A B, divided by the number of elements in their union, A B.

jaccard.svg

Jaccard similarity is a robust way to compare things in machine learning, say in clustering algorithms, less sensitive to outliers than other similarity measures such as cosine similarity.

Miles Davis Albums

Here we'll illustrate Jaccard similarity by looking at the personnel on albums by Miles Davis. Specifically, which pair of albums had more similar personnel: Kind of Blue and Round About Midnight, or Bitches Brew and In a Silent Way?

There were four musicians who played on both Kind of Blue and Round About Midnight: Miles Davis, Cannonball Adderly, John Coltrane, and Paul Chambers.

There were six musicians who played on both Bitches Brew and In a Silent Way: Miles Davis, Wayne Shorter, Chick Corea, Dave Holland, and John McLaughlin, Joe Zawinul.

The latter pair of albums had more personnel in common, but they also had more personnel in total.

There were 9 musicians who performed on either Kind of Blue or Round About Midnight. Since 4 played on both albums, the Jaccard index comparing the personnel on the two albums is 4/9.

In a Silent Way and especially Bitches Brew used more musicians. A total of 17 musicians performed on one of these albums, including 6 who were on both. So the Jaccard index is 6/17.

Jaccard distance

Jaccard distance is the complement of Jaccard similarity, i.e.

jaccard2.svg

In our example, the Jaccard distance between Kind of Blue andRound About Midnight is 1 - 4/9 = 0.555. The Jaccard distance between Bitches Brew and In a Silent Way is 1 - 6/17 = 0.647.

Jaccard distance really is a distance. It is clearly a symmetric function of its arguments, unlike Kulback-Liebler divergence, which is not.

The difficulty in establishing that Jaccard distance is a distance function, i.e. a metric, is the triangle inequality. The triangle inequality does hold, though this is not simple to prove.

The post Jaccard index and jazz albums first appeared on John D. Cook.
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments