The empty middle: why no one is average

John

from John D. Cook on 2016-02-20 18:04 (#14KDZ)

In 1945, a Cleveland newspaper held a contest to find the woman whose measurements were closest to average. This average was based on a study of 15,000 women by Dr. Robert Dickinson and embodied in a statue called Norma by Abram Belskie. Out of 3,864 contestants, no one was average on all nine factors, and fewer than 40 were close to average on five factors. The story of Norma and the Cleveland contest is told in Todd Rose's book The End of Average.

People are not completely described by a handful of numbers. We're much more complicated than that. But even in systems that are well described by a few numbers, the region around the average can be nearly empty. I'll explain why that's true in general, then look back at the Norma example.

General theory

Suppose you have N points, each described by n independent, standard normal random variables. That is, each point has the form (x₁, x₂, x₂, ", x_n) where each x_i is independent with a normal distribution with mean 0 and variance 1. The expected value of each coordinate is 0, so you might expect that most points are piled up near the origin (0, 0, 0, ", 0). In fact most points are in spherical shell around the origin. Specifically, as n becomes larger, most of the points will be in a thin shell with distance an from the origin. (More details here.)

Simulated contest

In the contest above, n = 9, and so we expect most contestants to be about a distance of 3 from average when we normalize each of the factors being measured, i.e. we subtract the mean so that each factor has mean 0, and we divide each by its standard deviation so the standard deviation is 1 on each factor.

We've made several simplifying assumptions. For example, we've assumed independence, though presumably some of the factors measured in the contest were correlated. There's also a selection bias: presumably women who knew they were far from average would not have entered the contest. But we'll run with our simplified model just to see how it behaves in a simulation.

import numpy as np# Winning critera: minimum Euclidean distancedef euclidean_norm(x): return np.linalg.norm(x)# Winning criteria: min-maxdef max_norm(x): return max(abs(x))n = 9N = 3864# Simulated normalized measurements of contestants M = np.random.normal(size=(N, n))euclid = np.empty(N)maxdev = np.empty(N)for i in range(N): euclid[i] = euclidean_norm(M[i,:]) maxdev[i] = max_norm(M[i,:])w1 = euclid.argmin()w2 = maxdev.argmin()print( M[w1,:] )print( euclidean_norm(M[w1,:]) )print( M[w2,:] )print( max_norm(M[w2,:]) )

There are two different winners, depending on how we decide the winner. Using the Euclidean distance to the origin, the winner in this simulation was contestant 3306. Her normalized measurements were

[ 0.1807, 0.6128, -0.0532, 0.2491, -0.2634, 0.2196, 0.0068, -0.1164, -0.0740]

corresponding to a Euclidean distance of 0.7808.

If we judge the winner to be the one whose largest deviation from average is the smallest, the winner is contestant 1916. Her normalized measurements were

[-0.3757, 0.4301, -0.4510, 0.2139, 0.0130, -0.2504, -0.1190, -0.3065, -0.4593]

with the largest deviation being the last, 0.4593.

By either measure, the contestant closest to the average deviated significantly from the average in at least one dimension.

ir?t=theende-20&l=as2&o=1&a=0062358367

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog