Naive compression of genetic data
There are special compression algorithms for genetic sequence data, but I was curious how well simply zipping a text file would work.
I downloaded a 14 MB text file containing DNA sequence data from a fruit fly and compressed it as a zip file and as a 7z file. The result was about 3.5 MB, which is basically no compression beyond the obvious.
The file contains millions of A's, C's, G's, and T's and nothing more [0]. (I prepared the file by removing comments and line breaks.)
AAAATCAATATGTTGCCATT...
There are only four possible characters, so each carries two bits of information [1], but is encoded in an 8-bit ASCII character. The most obvious encoding would pack four letters into a byte. That would compress a 14 MB text file down to a 3.5 MB binary file.
Here are the exact numbers.
|----------+----------+-------|| file | size | ratio||----------+----------+-------|| original | 14401122 | 1.000 || zip | 3875361 | 3.716 || 7z | 3419294 | 4.212 ||----------+----------+-------|
So the 7z format does a little better than simply packing groups of four letters into a byte, but the zip format does not.
There is repetition in genome sequences, but apparently generic compression software is unable to exploit this repetition. You can do much better with compression algorithms designed for genetic data.
UpdateThe plot thickens. Imran Haque kindly let me know that I overlooked the N's in the data. These are placeholders for unresolved bases. You can't simply encode each base using two bits because you need five states.
The number of N's is small-at least in this example, though I imagine they would be more common in lower quality data-and so the calculation in footnote [1] is still approximately correct [2]. There are about two bits of information in each base, but this is on average. Shannon would say you can expect to compress your text file by at least 75%. But you can't simply represent each base with two bits as I suggested because you need to make room for the possibility of a null base.
Note that the total information of a file is not the number of symbols times the information per symbol, unless all the symbols are independent. In the case of genetic data, the symbols are not independent, and so more compression is possible.
***
[0] Or so I thought. See the Update section.
[1] The letter frequencies are not exactly equal, but close: 26.75% A, 23.67% C, 23.94% G, and 25.58% T. The Shannon entropy is 1.998, so essentially two bits.
[2] N's account for 0.04% of the data. Accounting for the N's increases the Shannon entropy to 2.002.
The post Naive compression of genetic data first appeared on John D. Cook.