Article 3H6HY Bits of information in age, birthday, and birthdate

Bits of information in age, birthday, and birthdate

by
John
from John D. Cook on (#3H6HY)

birthday_party.jpeg

The previous post looked at how much information is contained in zip codes. This post will look at how much information is contained in someone's age, birthday, and birth date. Combining zip code with birthdate will demonstrate the plausibility of Latanya Sweeney's famous result [1] that 87% of the US population can be identified based on zip code, sex, and birth date.

Birthday

Birthday is the easiest. There is a small variation in the distribution of birthdays, but this doesn't matter for our purposes. The amount of information in a birthday, to three significant figures, is 8.51 bits, whether you include or exclude leap days. You can assume all birthdays are equally common, or use actual demographic data. It only makes a difference in the 3rd decimal place.

Age

I'll be using the following age distribution data found on Wikipedia.

|-----------+------------|| Age range | Population ||-----------+------------|| 0- 4 | 20201362 || 5- 9 | 20348657 || 10-14 | 20677194 || 15-19 | 22040343 || 20-24 | 21585999 || 25-29 | 21101849 || 30-34 | 19962099 || 35-39 | 20179642 || 40-44 | 20890964 || 45-49 | 22708591 || 50-54 | 22298125 || 55-59 | 19664805 || 60-64 | 16817924 || 65-69 | 12435263 || 70-74 | 9278166 || 75-79 | 7317795 || 80-84 | 5743327 || 85+ | 5493433 ||-----------+------------|

To get data for each particular age, I'll assume ages are evenly distributed in each group, and I'll assume the 85+ group consists of people from ages 85 to 92. [2]

With these assumptions, there are 6.4 bits of information in age. This seems plausible: if all ages were uniformly distributed between 0 and 63, there would be exactly 6 bits of information since 26 = 64.

Birth date

If we assume birth days are uniformly distributed within each age, then age and birth date are independent. The information contained in the birth date would be the sum of the information contained in birthday and age, or 8.5 + 6.4 = 14.9 bits.

Zip code, sex, and age

The previous post showed there are 13.8 bits of information in a zip code. There are about an equal number of men and women, so sex adds 1 bit. So zip code, sex, and birth date would give a total of 29.7 bits. Since the US population is between 228 and 229, it's plausible that we'd have enough information to identify everyone.

We've made a number of simplifying assumptions. We were a little fast and loose with age data, and we've assumed independence several times. We know that sex and age are not independent: more babies are boys, but women live longer. Still, Latanya Sweeney found empirically that you can identify 87% of Americans using the combination of zip code, sex, and birth date [1]. Her study was based on 1990 census data, and at that time the US population was a little less than 228.

Related posts

***

[1] Latanya Sweeney. "Simple Demographics Often Identify People Uniquely". Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000. Available here.

[1] Bob Wells and Mel Tormi(C). "The Christmas Song." Commonly known as "Chestnuts Roasting on an Open Fire."

4fS7kWKRUbA
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments