Big aggregate queries can still violate privacy
Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies.
- Data scientists can only query aggregate statistics, such as counts and averages.
- These aggregate statistics must be based on results that return at least 1,000 database rows.
This sounds good, but it's naive. It's not enough to protect customer privacy.
Someone wants to know how much money customer 123456789 makes. If he asks for this person's income, the query would be blocked by the first rule. If he asks for the average income of customers with ID 123456789 then he gets past the first rule but not the second.
He decides to test whether the customer in question makes a six-figure income. So first he queries for the number of customers with income over $100,000. This is a count so it gets past the firs rule. The result turns out to be 14,254, so it gets past the second rule as well. Now he asks how many customers with ID not equal to 123456789 have income over $100,000. This is a valid query as well, and returns 14,253. So by executing only queries that return aggregate statistics on thousands of rows, he found out that customer 123456789 has at least a six-figure income.
Now he goes back and asks for the average income of customers with income over $100,000. Then he asks for the average income of customers with income over $100,000 and with ID not equal to 123456789. With a little algebra, he's able to find customer 123456789's exact income.
You might object that it's cheating to have a clause such as "ID not equal 123456789" in a query. Of course it's cheating. It clearly violates the spirit of the law, but not the letter. You might try to patch the rules by saying you cannot ask questions about a small set, nor about the complement of a small set. [1]
That doesn't work either. Someone could run queries on customers with ID less than or equal to 123456789 and on customers with ID greater than or equal to 123456789. Both these sets and their complements may be large but they let you find out information on an individual.
You may be asking why let a data scientist have access to customer IDs at all. Obviously you wouldn't do that if you wanted to protect customer privacy. The point of this example is that limiting queries to aggregates of large sets is not enough. You can find out information on small sets from information on large sets. This could still be a problem with obvious identifiers removed.
Related:
***
[1] Readers familiar with measure theory might sense a If-algebra lurking in the background.