Squeezing the genome: how to shrink your whole-genome sequence to 4 MB

A new paper in Bioinformatics describes an efficient compression algorithm that allows an individual's complete genome sequence to be compressed down to a vanishingly small amount of data - just 4 megabytes (MB).

The paper takes a similar approach to the process I described in a post back in June last year (sheesh, if only I'd thought to write that up as a paper instead!). I estimated using that approach that the genome could be shrunk down to just 20 MB - compared to about 1.5 GB if you stored the entire sequence as a flat text file - with even further compression if you took advantage of databases of genetic variation like dbSNP. The basis of this compression is the use of a universal reference sequence. Each individual will differ at only a minority of sites (about 0.1%) from this reference, so you can save huge amounts of space simply by not storing the vast majority of the bases where their sequence is the same, and instead just creating a compressed list of the differences.

The authors of this paper add some further refinements to this approach that I hadn't considered, such as taking advantage of the repetitive nature of the human genome to further compress the sequence of insertions (i.e. areas of the genome that are present in the individual but not in the reference sequence). It's worth noting, however, that the benefit of this tactic will erode over time as the reference sequence becomes steadily more complete, and eventually becomes a montage containing all of the unique sequence found in common insertion variants in the population as a whole. (Then most variations will be deletions relative to the reference rather than insertions, and deletions take up a lot less data.)

While all this is very impressive, making such a heroic effort to compress the genome is probably a little excessive given how rapidly digital storage space is growing. From a personal genome point of view, most of us already carry gigabytes of digital storage on our person most of the time, so shrinking sequences down to 4 MB (which comes at the cost of adding to the time required to access the data in that sequence) is probably unnecessary - less stringent compression would probably be fine in most cases. However, I suppose that extreme compression may be useful for organisations that intend to archive extremely large numbers of complete genome sequences (assuming that sequencing costs continue to drop faster than digital storage costs).

And of course there's the whole issue of the need to store sequence quality data. The system in the article works fine for a complete, perfectly accurate genome sequence, but right now no sequencing platform is capable of generating such a sequence - far from it, in fact. It's likely that for the foreseeable future personal genome sequences will contain a mixture of both high- and low-quality sequence, and it will thus be useful to keep them attached to information on the confidence of each called base. That will add at least somewhat to the size of the storage space required.

Still, I imagine this paper was designed as more of an intellectual than a practical exercise. I look forward to the inevitable Netflix Prize-style arms race as competing genome enthusiasts struggle to squeeze out even more extraneous kilobytes over the next few years.

Subscribe to Genetic Future.

S. Christley, Y. Lu, C. Li, X. Xie (2008). Human genomes as email attachments Bioinformatics, 25 (2), 274-275 DOI: 10.1093/bioinformatics/btn582

More like this

...that is, if you still think that a genome sequence tells all secrets about someone's success in science etc. ;-) But the new paper actually uses Venter's personal genome to do some nifty stuff, as this is the first time a genome containing the sequences from BOTH sets of chromosomes of a…
A few weeks back, we published a review about the development and role of the human reference genome. A key point of the reference genome is that it is not a single sequence. Instead it is an assembly of consensus sequences that are designed to deal with variation in the human population and…
In our series on why $1000 genomes cost $2000, I raised the issue that the $1000 genome is a value based on simplistic calculations that do not account for the costs of confirming the results. Next, I discussed how errors are a natural occurrence of the many processing steps required to sequence…
Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome: Structural variation of the genome involves kilobase- to megabase-sized deletions, duplications, insertions, inversions, and complex combinations of rearrangements. We introduce high-throughput and massive paired-end…

I'm not as up on my DNA as I should be but aren't there only four base pairs? And of those four, each 2 have an affinity for each other.

This reduces the overall number of possible configurations of a base pair in DNA.

That said, compression is pretty much about pattern recognition.

Let say you have this string:

ABDC CDBA ACBD that repeats several times. In flat form you're occupying 12 bytes but you can tokenize the repeating segment say with a 1 byte code.

What I've described is overly simplistic. But I recall that the BASIC operating system on the first TRS-80 Model 1 Level II tokenized all the BASIC commands.

Hmm... let's see... (ignoring that whole pesky 10^3 isn't quite 2^10 thing...)

4MB per person... that would be
4GB per 1000 people
4TB per 1 million people
4 Petabytes per Billion people....
so with this technique we could fit the genetic information of 10 BILLION people into 40 Petabytes.

Wikipedia has some pretty cool examples of Petabyte scale data:
http://en.wikipedia.org/wiki/Petabyte

Typical new home drives seems to be in the 1TB range right now... so if hard drives continue to improve commensurate with past improvement, then we'll probably see desktop 40 PB drives in about 15 years... boggles the mind.