Comparing Truncation to Differential Privacy
Traditional methods of data de-identification obscure data values. For example, you might truncate a date to just the year.
Differential privacy obscures query values by injecting enough noise to keep from revealing information on an individual.
Let's compare two approaches for de-identifying a person's age: truncation and differential privacy.
TruncationFirst consider truncating birth date to year. For example, anyone born between January 1, 1955 and December 31, 1955 would be recorded as being born in 1955. This effectively produces a 100% confidence interval that is one year wide.
Next we'll compare this to a 95% confidence interval using I-differential privacy.
Differential privacyDifferential privacy adds noise in proportion to the sensitivity I" of a query. Here sensitivity means the maximum impact that one record could have on the result. For example, a query that counts records has sensitivity 1.
Suppose people live to a maximum of 120 years. Then in a database with n records [1], one person's presence in or absence from the database would make a difference of no more than 120/n years, the worst case corresponding to the extremely unlikely event of a database of n-1 newborns and one person 120 year old.
Laplace mechanism and CIsThe Laplace mechanism implements I-differential privacy by adding noise with a Laplace(I"/I) distribution, which in our example means Laplace(120/nI).
A 95% confidence interval for a Laplace distribution with scale b centered at 0 is
[b log 0.05, -b log 0.05]
which is very nearly
[-3b, 3b].
In our case b = 120/nI, and so a 95% confidence interval for the noise we add would be [-360/nI, 360/nI].
When n = 1000 and I = 1, this means we're adding noise that's usually between -0.36 and 0.36, i.e. we know the average age to within about 4 months. But if n = 1, our confidence interval is the true age 360. Since this is wider than the a priori bounds of [0, 120], we'd truncate our answer to be between 0 and 120. So we could query for the age of an individual, but we'd learn nothing.
Comparison with truncationThe width of our confidence interval is 720/I, and so to get a confidence interval one year wide, as we get with truncation, we would set I = 720. Ordinarily I is much smaller than 720 in application, say between 1 and 10, which means differential privacy reveals far less information than truncation does.
Even if you truncate age to decade rather than year, this still reveals more information than differential privacy provided I < 72.
Related posts[1] Ordinarily even the number of records in the database is kept private, but we'll assume here that for some reason we know the number of rows a priori.