Compressing a set of hash values

John

from John D. Cook on 2026-01-09 13:01 (#72Q1H)

Suppose you have a set of k hash values, eachn bits long. Can you compress the set into less thankn bits?

It's not possible to compress alist of hashes into less thankn bits, but you can hash aset into fewer bits.

Suppose you have a set of 2³⁰, roughly a billion, 64-bit hashes. Sort the set and look at the size of gaps between elements. You might expect that consecutive items on the list are roughly 2³⁴ apart, and so you could reconstruct the list by reporting the first item and the gaps between the rest, which are 34-bit numbers, not 64-bit numbers, a savings of 30 bits per hash.

This doesn't exactly work, but it's the kernel of a good idea. We don't know that the distance between hashes can be represented by a 34 bit number. The gap could be more or less than 2³⁴, but we don't expect it to often be much more than 2³⁴. So we use a variable-length encoding such that when the distance between values is on the order of 2³⁴, or less, we save bits, but we allow for the distance to be any size.

Applications

What we've described is the essence of Golomb-Rice coding. Its implementation in the Bitcoin protocol is referred to asGolomb-Coded Sets (GCS), described in BIP 158. Golomb-Rice coding is also used other applications where the values to be compressed are not hashes, such as in lossless auto compression.

Encoding

Let's go into some detail as to how the distances between sorted values are represented. Suppose you expect the differences to be on the order of M where M is a power of 2. For each differenced, let q andr be the quotient and remainder byM, i.e.

d =qM +r.

Encode q as a unary number, i.e. string of q 1s, and encode r as an ordinary binary number. Then Golomb-Rice coding of d is the concatenation of the representations ofq andr. with a 0 in the middle as a separator. Using || to denote string concatenation we have

unary(q) || 0 || binary(r).

In general, unary encoding is extremely inefficient, but we're betting that q will typically be quite small.

The reason we require M to be a power of 2 is so the representation of r will have a fixed length[1].

Let's work out an example. Suppose M = 2²⁰ and

d = 2²² + 123456 = 4 * M + 123456.

Then we writeq as 1111 andr as 0011110001001000000 and encoded as the string

111100011110001001000000

Decoding

You can concatenate the encodings of consecutive d values and still be able to unambiguously decode the result. Because the r representations have a constant length, you know when anr ends and the nextq begins.

For example, suppose we have the following string of bits.

1110010011001011001011111001000000110010001110011101111000101111011

We decode the string from left to right. We count how many ones we see, skip over a 0, then regarding the next 20 bits as a binary number.

111 0 01001100101100101111 1 0 01000000110010001110 0 11101111000101111011

We see three 1s before the first 0 and so we conclude q₁ = 3. We skip over the 0 and read the value ofr.

r₁ = 01001100101100101111_two = 314159.

Next we see a single 1 before the next 0, so q₂ = 1. We read the next value ofr as

r₂ = 01000000110010001110_two = 265358.

Next we see a 0, i.e. there are no 1s, and so the final q₃ = 0. And we have

r₃ = 11101111000101111011_two = 979323.

So our reconstructed values ofd are

d₁ = q₁ M+ r₁ = 3 * 2²⁰ + 314159 = 3459887

d₂ = q₂ M+ r₂ = 1 * 2²⁰ + 265358 = 1313934

d₃ = q₃ M+ r₃ = 0 * 2²⁰ + 979323 = 979323.

Now these are difference values. We need to know the smallest valuem in order to construct the original set of values from the differences. Then the full values arem, m + d₁, m + d₁+ d₂, and m + d₁+ d₂+ d₃,.

[1] This is the Rice part. Robert Rice simplified Samuel Golomb's encoding scheme in the special case that M is a power of 2.

The post Compressing a set of hash values first appeared on John D. Cook.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog