Article 6G6JT Earth mover’s distance

Earth mover’s distance

by
John
from John D. Cook on (#6G6JT)

There are many ways to describe the distance between two probability distributions. The previous two posts looked at using the p-norm to measure the difference between the PDFs and using Kullbach-Leibler divergence. Earth mover's distance (EMD) is yet another approach.

Imagine a probability distribution on ^2 as a pile of dirt. Earth mover's distance measures how different two distributions are by how much work it would take to reshape the pile of dirt representing one distribution into a pile of dirt representing the other distribution. Unlike KL divergence, earth mover's distance is symmetric, and so it really is a distance. (EMD is a colorful name for what is more formally known as the Wasserstein metric.)

The concept of t-closeness in data privacy is based on EMD. Deidentification procedures such as k-anonymity that protect individual privacy may not protect group privacy. t-closeness measures the distribution of values of some attribute in a group and compares this distribution to that of the overall distribution using EMD.

Earth mover's distance is difficult to compute, or even to rigorously define, when working in several dimensions, but in one dimension it is particularly simple. The 1-Wasserman distance between two probability distributions is simply the 1-norm distance between the corresponding CDFs.

emd1.svg
There are p-Wasserstein metrics just as their are p-norms, but the case p = 1 is particularly simple and so we will focus on it for this post.

We can illustrate the univariate Wasserstein metric by returning to a problem in a recent post, namely now to optimally approximate a standard normal by a logisitic distribution.

Logistic distribution example

One of the nice things about the logistic distribution is that its CDF is an elementary function. If X is a logistic distribution with mean 0 and scale s then the CDF is

logistic_cdf.svg

The CDF of a normal distribution has no elementary form but can be written in terms of the complementary error function. If Z is a standard normal random variable, then

normal_cdf_erfc.svg

We get a distance of 0.05926 if we use the value of s = 0.5513 obtained from moment matching here. The optimal value is s = 0.5867, a little smaller than the optimal values of s when minimizing the 1, 2, and norms which were around 0.61.

Related postsThe post Earth mover's distance first appeared on John D. Cook.
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments