Article 4J9PS Notes on computing hash functions

Notes on computing hash functions

by
John
from John D. Cook on (#4J9PS)

A secure hash function maps a file to a string of bits in a way that is hard to reverse. Ideally such a function has three properties:

  1. pre-image resistance
  2. collision resistance
  3. second pre-image resistance

Pre-image resistance means that starting from the hash value, it is very difficult to infer what led to that output; it essentially requires a brute force attack, trying many inputs until something hashes to the given value.

Collision resistance means its extremely unlikely that two files would map to the same hash value, either by accident or by deliberate attack.

Second pre-image resistance is like collision resistance except one file is fixed. A second pre-image attack is harder than a collision attack because the attacker can only vary one file.

This post explains how to compute hash functions from the Linux command line, from Windows, from Python, and from Mathematica.

Files vs strings

Hash functions are often applied to files. If a web site makes a file available for download, and publishes a hash value, you can compute the hash value yourself after downloading the file to make sure they match. A checksum could let you know if a bit was accidentally flipped in transit, but it's easy to deliberately tamper with files without changing the checksum. But a secure hash function makes such tampering unfeasible.

You can think of a file as a string or a string as a file, but the distinction between files and strings may matter in practice. When you save a string to a file, you might implicitly add a newline character to the end, causing the string and its corresponding file to have different hash values. The problem is easy to resolve if you're aware of it.

Another gotcha is that text encoding matters. You cannot hash text per se; you hash the binary representation of that text. Different representations will lead to different has values. In the examples below, only Python makes this explicit.

openssl digest

One way to compute hash values is using openssl. You can give it a file as an argument, or pipe a string to it.

Here's an example creating a file f and computing its SHA256 hash.

 $ echo "hello world" > f $ openssl dgst -sha256 f SHA256(f)= a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447

We get the same hash value if we pipe the string "hello world" to openssl.

 $ echo "hello world" | openssl dgst -sha256 a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447

However, echo silently added a newline at the end of our string. To get the hash of "hello world" without this newline, use the -n option.

 $ echo -n "hello world" | openssl dgst -sha256 b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9

To see the list of hash functions openssl supports, use list --digest-commands. Here's what I got, though the output could vary with version.

 $ openssl list --digest-commands blake2b512 blake2s256 gost md4 md5 mdc2 rmd160 sha1 sha224 sha256 sha3-224 sha3-256 sha3-384 sha3-512 sha384 sha512 sha512-224 sha512-256 shake128 shake256 sm3
A la carte commands

If you're interested in multiple hash functions, openssl has the advantage of handling various hashing algorithms uniformly. But if you're interested in a particular hash function, it may have its only command line utility, such as sha256sum and md5sum. But these are not named consistently. For example, the utility to compute BLAKE2 hashes is b2sum.

hashalot

The hashalot utility is designed for hashing passphrases. As you type in a string, the characters are not displayed, and the input is hashed without a trailing newline character.

Here's what I get when I type "hello world" at the passphrase prompt below.

 $ hashalot -x sha256 Enter passphrase: b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9

The -x option tells hashalot to output hexadecimal rather than binary.

Note that this produces the same output as

 echo -n "hello world" | openssl dgst -sha256

According to the documentation,

 Supported values for HASHTYPE: ripemd160 rmd160 rmd160compat sha256 sha384 sha512
Python hashlib

Python's hashlib library supports several hashing algorithms. And unlike the examples above, it makes the encoding of the input and output explicit.

 import hashlib print(hashlib.sha256("hello world".encode('utf-8')).hexdigest())

This produces b94d"cde9 as in the examples above.

hashlib has two attributes that let you know which algorithms are available. The algorithms_available attribute is the set of hashing algorithms available in your particular instance, and the algorithms_guaranteed attribute is the set of algorithm guaranteed to be available anywhere the library is installed.

Here's what I got on my computer.

 >>> a = hashlib.algorithms_available >>> g = hashlib.algorithms_guaranteed >>> assert(g.issubset(a)) >>> g {'sha1', 'sha512', 'sha3_224', 'shake_256', 'sha3_256', 'sha256', 'shake_128', 'sha224', 'md5', 'sha384', 'blake2s', 'sha3_512', 'blake2b', 'sha3_384'} >>> a.difference(g) {'md5-sha1', 'mdc2', 'sha3-384', 'ripemd160', 'blake2s256', 'md4', 'sha3-224', 'whirlpool', 'sha512-256', 'blake2b512', 'sha512-224', 'sm3', 'shake128', 'shake256', 'sha3-512', 'sha3-256'} 
Hashing on Windows

Windows has a utility fciv whose name stands for "file checksum integrity verifier". It only supports the broken hashes MD5 and SHA1 [1].

PowerShell has a function Get-FileHash that uses SHA256 by default, but also supports SHA1, SHA384, SHA512, and MD5.

Hashing with Mathematica

Here's our running example, this time in Mathematica.

 Hash["hello world", "SHA256", "HexString"]

This returns b94d"cde9 as above. Other hash algorithms supported by Mathematica: Adler32, CRC32, MD2, MD3, MD4, MD5, RIPEMD160, RIPEMD160SHA256, SHA, SHA256, SHA256SHA256, SHA384, SHA512, SHA3-224, SHA3-256, SHA3-384, SHA3-512, Keccak224, Keccak256, Keccak384, Keccak512, Expression.

Names above that concatenate two names are the composition of the two functions. RIPEMD160SHA256 is included because of its use in Bitcoin. Here "SHA" is SHA-1. "Expression" is a non-secure 64-bit hash used internally by Mathematica.

Mathematica also supports several output formats besides hexadecimal: Integer, DecimalString, HexStringLittleEndian, Base36String, Base64Encoding, and ByteArray.

Related posts

[1] It's possible to produce MD5 collisions quickly. MD5 remains commonly used, and is fine as a checksum, though it cannot be considered a secure hash function any more.

Google researchers were able to produce SHA1 collisions, but it took over 6,000 CPU years distributed across many machines.

sbokW109yAo
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments