Hashing names does not protect privacy
Secure hash functions are practically impossible to reverse, but only if the input is unrestricted.
If you generate 256 random bits and apply a secure 256-bit hash algorithm, an attacker wanting to recover your input can't do much better than brute force, hashing 256-bit strings hoping to find one that matches your hash value. Even then, the attacker may have found a collision, another string of bits that happens to have the same hash value.
But you know the input comes from a restricted set, a set small enough to search by brute force, then hash functions are reversible. If I know that you've either hashed "yes" or "no," then I can apply the hash function to both and see which one it was.
Hashing PIISuppose someone has attempted to anonymize a data set by hashing personally identifying information (PII) such as name, phone number, etc. These inputs come from a small enough space that a brute force search is easy.
For instance, suppose someone has applied a cryptographic hash to first names. Then all an attacker needs to do is find a list of common names, hash them all, and see which hash values match. I searched for a list of names and found this, a list of the 1000 most popular baby girl and boy names in California in 2017.
The data set was compiled based on 473,441 births. Of those births, 366,039 had one of the 2,000 names. That is, 77% of the babies had one of the 1,000 most common names for their sex.
I wrote a little script to read in all the names and compute a SHA256 hash. The program took a fraction of a second to run. With the output of this program, I can't re-identify every first name in a data set, but I could re-identify 77% of them, assuming my list of names is representative [1].
Now, for example, if I see the hash value
96001f228724d6a56db13d147a9080848103cf7d67bf08f53bda5d038777b2e6
in the data set, I can look this value up in my list of 2000 hash values and see that it is the hashed value of ACHILLES [2].
If you saw the hash value above and had no idea where it came from-it could be the hash of a JPEG image file for all you know-it would be hopeless to try to figure out what produced it. But if you suspect it's the hash of a first name, it's trivial to reverse.
Hashing SSNsHashing numbers is simpler but more computationally intense. I had to do a little research to find a list of names, but I know that social security numbers are simply 9-digit numbers. There are only a billion possible nine-digit numbers, so it's feasible to hash them all to make a look-up table.
I tried this using a little Python script on an old laptop. In 20 seconds I was able to create a file with the hash values of a million SSNs, so presumably I could have finished the job in about five and a half hours. Of course it would take even less time with more efficient code or a more powerful computer.
Using a keyOne way to improve the situation would be to use a key, a random string of bits that you combine with values before hashing. An attacker not knowing your secret key value could not do something as simple as what was described above.
However, in a large data set, such as one from a data breach, an attacker could apply frequency analysis to get some idea how hash values correspond to names. Hash values that show up most frequently in the data probably correspond to popular names etc. This isn't definitive, but it is useful information. You might be able to tell, for example, that someone has a common first name and a rare last name. This could help narrow down the possibilities for identifying someone.
Adding saltSeveral people have commented that the problem goes away if you use a unique salt value for each record. In some ways this is true, but in others it is not.
If you do use a unique salt per record, and save the salt with the data, then you can still do a brute force attack. The execution time is now quadratic rather than linear, but still feasible.
If you throw the salt away, then you've effectively replaced each identifier with a random string. Then you've effectively removed these columns since they're filled with useless random noise. Why not just delete them?
You could use a unique salt per user rather than per record. Then you couldn't identify a given record, but you could tell when two records belong to the same person. But in this case, why not just assign random IDs to each user? Use the salt itself as the ID. No need to use a hash function.
Custom hash functionsNote that the argument above for keys applies to using a custom hashing algorithm, either something you write from scratch or by combining rounds of established methods.
Kirchoff's principle advises against relying on security-by-obscurity. It says that you should assume that your algorithm will become known, which often turns out to be the case. But no matter: any hashing scheme that always maps the same input to the same output is vulnerable to frequency analysis.
Related posts- California's new CCPA and deidentified data
- Comparing truncation to differential privacy
- Data privacy consulting
[1] Name frequencies change over time and space, but I imagine the data from California in 2017 would be adequate to identify most Americans of any age and in any state. Certainly this would not meet the HIPAA standard that "the risk is very small that the information could be used " to identify an individual."
[2] The California baby name data set standardized all names to capital letters and removed diacritical marks. With a little extra effort, one could hash variations such as JOSE, JOSi, Jose, and Josi(C).