Squeezing the genome: how to shrink your whole-genome sequence to 4 MB

By dgmacarthur on January 16, 2009.

A new paper in Bioinformatics describes an efficient compression algorithm that allows an individual's complete genome sequence to be compressed down to a vanishingly small amount of data - just 4 megabytes (MB).

The paper takes a similar approach to the process I described in a post back in June last year (sheesh, if only I'd thought to write that up as a paper instead!). I estimated using that approach that the genome could be shrunk down to just 20 MB - compared to about 1.5 GB if you stored the entire sequence as a flat text file - with even further compression if you took advantage of databases of genetic variation like dbSNP. The basis of this compression is the use of a universal reference sequence. Each individual will differ at only a minority of sites (about 0.1%) from this reference, so you can save huge amounts of space simply by not storing the vast majority of the bases where their sequence is the same, and instead just creating a compressed list of the differences.

The authors of this paper add some further refinements to this approach that I hadn't considered, such as taking advantage of the repetitive nature of the human genome to further compress the sequence of insertions (i.e. areas of the genome that are present in the individual but not in the reference sequence). It's worth noting, however, that the benefit of this tactic will erode over time as the reference sequence becomes steadily more complete, and eventually becomes a montage containing all of the unique sequence found in common insertion variants in the population as a whole. (Then most variations will be deletions relative to the reference rather than insertions, and deletions take up a lot less data.)

While all this is very impressive, making such a heroic effort to compress the genome is probably a little excessive given how rapidly digital storage space is growing. From a personal genome point of view, most of us already carry gigabytes of digital storage on our person most of the time, so shrinking sequences down to 4 MB (which comes at the cost of adding to the time required to access the data in that sequence) is probably unnecessary - less stringent compression would probably be fine in most cases. However, I suppose that extreme compression may be useful for organisations that intend to archive extremely large numbers of complete genome sequences (assuming that sequencing costs continue to drop faster than digital storage costs).

And of course there's the whole issue of the need to store sequence quality data. The system in the article works fine for a complete, perfectly accurate genome sequence, but right now no sequencing platform is capable of generating such a sequence - far from it, in fact. It's likely that for the foreseeable future personal genome sequences will contain a mixture of both high- and low-quality sequence, and it will thus be useful to keep them attached to information on the confidence of each called base. That will add at least somewhat to the size of the storage space required.

Still, I imagine this paper was designed as more of an intellectual than a practical exercise. I look forward to the inevitable Netflix Prize-style arms race as competing genome enthusiasts struggle to squeeze out even more extraneous kilobytes over the next few years.

Subscribe to Genetic Future.

S. Christley, Y. Lu, C. Li, X. Xie (2008). Human genomes as email attachments Bioinformatics, 25 (2), 274-275 DOI: 10.1093/bioinformatics/btn582

More like this

I'm not as up on my DNA as I should be but aren't there only four base pairs? And of those four, each 2 have an affinity for each other.

This reduces the overall number of possible configurations of a base pair in DNA.

That said, compression is pretty much about pattern recognition.

Let say you have this string:

ABDC CDBA ACBD that repeats several times. In flat form you're occupying 12 bytes but you can tokenize the repeating segment say with a 1 byte code.

What I've described is overly simplistic. But I recall that the BASIC operating system on the first TRS-80 Model 1 Level II tokenized all the BASIC commands.

Hmm... let's see... (ignoring that whole pesky 10^3 isn't quite 2^10 thing...)

4MB per person... that would be
4GB per 1000 people
4TB per 1 million people
4 Petabytes per Billion people....
so with this technique we could fit the genetic information of 10 BILLION people into 40 Petabytes.

Wikipedia has some pretty cool examples of Petabyte scale data:
http://en.wikipedia.org/wiki/Petabyte

Typical new home drives seems to be in the 1TB range right now... so if hard drives continue to improve commensurate with past improvement, then we'll probably see desktop 40 PB drives in about 15 years... boggles the mind.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Fossil discovery is a new missing link in modern fish evolution

More by this author

Genetic Future is moving

January 18, 2011

After a semi-hiatus due to various distractions, I'm about to restart blogging in earnest again over at the new home of Genetic Future on Wired Science. Please update your RSS feed: my new one is here. And a reminder: you can always keep track of new posts here as well as other nuggets of…

One more step towards the end of recessive diseases

January 13, 2011

In the last century infant mortality has declined precipitously in the Western world, thanks in large part to the development of antibiotics and vaccination. Yet as the suffering and death from infectious disease has reduced, the burden from genetic disease has become proportionately greater:…

New FireFox plugin for 23andMe customers

January 11, 2011

Software company 5AM Solutions has just launched a neat little FireFox plug-in for customers of consumer genomics company 23andMe. The idea is very simple: Download your raw data from 23andMe (or use one of the files from me or my colleagues at Genomes Unzipped); Install the…

Why you CAN have your $1000 genome - so long as you learn what to do with it

January 7, 2011

As part of his Gene Week celebration over at Forbes, Matthew Herper has a provocative post titled "Why you can't have your $1000 genome". In this post I'll explain why, while Herper's pessimism is absolutely justified for genomes produced in a medical setting, I'm confident that I'll be obtaining…

Bioscience Resource Project critique of modern genomics: a missed opportunity

December 15, 2010

Late last week I stumbled across a press release with an attention-grabbing headline ("The Causes of Common Diseases are Not Genetic Concludes a New Analysis") linking to a lengthy blog post at the Bioscience Resource Project, a website devoted to food and agriculture. The post, written by two…