PGP sequence data disappointing

By dgmacarthur on October 23, 2008.

The promise of release of raw sequence data files from the first 10 Personal Genome Project volunteers certainly caused a media stir (see the round-up by the PGP's own Jason Bobe), but the actual released data are pretty underwhelming.

So far raw sequence data files have been posted on the PGP profile sites of only four of the ten volunteers: George Church, John Halamka, Esther Dyson and James Sherley. The files are the result of targeted resequencing of a proportion (perhaps 20%) of the protein-coding regions of the genome (called exons, collectively the exome). Although a relatively small proportion of the genome as a whole, the exome is highly enriched for functionally important variation, so this small slice of sequence could actually be quite informative about the genetic variants associated with diseases and physical variation.

However, when I downloaded and examined the files my hopes weren't too high - participant Misha Angrist had already warned on his blog that the data release was not the world-changing event that the media might lead you to believe:

I have to say, this whole extravaganza felt more like a walk-through or a dress rehearsal. Several of us did not get our sequence data yesterday and those who did got very rough, low-coverage data.

After checking over the files I agree with Misha about the "rough, low-coverage" bit - the data released so far makes for pretty desultory reading. For instance, here's a snippet from Esther Dyson's file:

@227
nnnnnnnnnnnnnntcttacaggtgtgtttatctatcgatcatcCTCAGAAggtcttaAT
TATGGGTGAAGCTCTTGACCtgggaacctgtaaannnnnnnnnnnnaatggagagCCGTG
CACGCAGACTGTGAattKggtTGGTTTCAgccnnnnnnnnn
@228
nnnnngtgtkgACCTGGCACAGGAATACCCCAGAAGAGCCTGTCTTGCTCTGAGGAGTTC
AAGGAACTGATGGACCTGCCGACGTGTGGAGCCAGGAACTTAAAACAACATTTAGCCAAA
GCCACAGCTTCAGGTACCATCAGctgsttnnnnnn
@229
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnaaaaccagccatcaagtccatc
tcggcctcagcactnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
@230
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnn

(For those who know and care about these things: the sequences are in FASTQ format, and I've stripped out the '+' and quality score lines for clarity.)

You'll note that the file is divided into short snippets of DNA sequence separated by headers (the lines preceded by '@'). There's about 55,000 of these fragments in the files, but a fairly hefty proportion are low-quality sequences with no real genetic information (the 'nnnnn' fragments above).

The files are difficult to interpret at first glance, with no identifiers indicating what each sequence entry represents. However on closer inspection each entry in the data files appears to represent a single sequenced exon covered by multiple reads from a next-generation sequencing platform. It's difficult to estimate the true coverage of these exons from the data provided, but the overall sequence quality certainly doesn't look great - here's a breakdown for the data from John Halamka as an example:

55,054 exons sequenced in total, with an average length of ~163 bases;
overall, more than half of the targeted bases (56%) have no available sequence data (i.e. are marked 'n');
16,644 (30.2%) of the exons contain absolutely no interpretable sequence (i.e. are entirely 'n');
of the remainder, 15,549 (28.2% of the total) contain 30 or more contiguous 'n's.

In total, these data provide actual sequence information for just 0.13% or so of the total genome - but even that fraction is likely to be a serious over-estimate, since many of the bases that have been called will have low coverage and thus be unreliable. Determining the reliability of each base call using the provided quality scores is not straightforward, but I'll be doing my best to sort this out over the next few days.

To be fair, the files are clearly marked "Preliminary exon data", and it's still very early days for the PGP - I would expect to see a dramatic increase in the volume and quality of the sequence data released over the coming months. However, given the hype surrounding this data release I'm a little disappointed by the data itself. Can genome sequence data really be said to be publicly available when they're dumped on the web in a flat text file without any gene annotation or explanation regarding its format, making it useless for anyone other than bioinformaticians?

We're in a crucial window of time here: the PGP (along with high-profile companies like 23andMe) is blazing the trail for the whole field of personal genomics, and the world is watching. I hope that future data releases from the PGP can give the world something to genuinely get excited about.

Subscribe to Genetic Future.

More like this

BLASTing through the flu: activity 5, how similar is similar?

No more delays! BLAST away! Time to blast. Let's see what it means for sequences to be similar. First, we'll plan our experiment. When I think about digital biology experiments, I organize the steps in the following way:

Shotgun Sequencing a Eukaryotic Genome

Shotgun sequencing refers to the process whereby a genome is sequenced and assembled with no prior information regarding the genomic location of any of the DNA we sequence. There are quite a few steps that you have to go through before you have an assembled genome sequence.

Development and Role of the Human Reference Sequence in Personal Genomics

A few weeks back, we published a review about the development and role of the human reference genome. A key point of the reference genome is that it is not a single sequence.

More flu follies: comparing sequences and making trees, activity 4

What tells us that this new form of H1N1 is swine flu and not regular old human flu or avian flu? If we had a lab, we might use antibodies, but when you're a digital biologist, you use a computer.

Well, c'mon Daniel. Sequence data isn't worth anything without the bioinformatics. ;-)

I've also mentioned this on
http://seqanswers.com/forums/showthread.php?p=1934

I've written a tool to recognize rs# sequences in the fasta, but it needs assembled fasta, not these short reads. I may try to do my own assembly, but at the moment time is too tight, and I expect it won't be all that cheap even with the amazon cloud.

What I'm disappointed by is the lack of the microarray data release. That has been completed since January 2008 as evidenced by
http://geekdoctor.blogspot.com/2008/01/personalized-medicine.html

I'd like to prepare promethease reports for all of the PGP10, but none of that data has been released publicly except for 2 of them which were given to me directly.

Hi Daniel -

Thanks for the critical eye about the data. We're in the process of posting more information about the PGP data quality goals, how the preliminary data stacks-up, why we felt it was necessary and important to release preliminary data, and QC methods (e.g. concordancy across data sets, samples, etc).

Cariaso: We'll be posting genotyping data soon too.

Thanks,
Jason Bobe

Hi Jason,

Sounds good - but it would have been nice to emphasise how preliminary the data were before they were released publicly. I know it's hard to control what the media decide to report or ignore, but posting a definitive statement on the PGP website about exactly what was being sequenced, using what method and at what coverage would have been extremely helpful.

According to the UCSC genome browser, read number 228 in the post contains a 100% identical match to a coding exon for gene MCM10 on chromosome 10. But I agree that overall this preliminary level of data quality is not sufficient for any scientific inferencing.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Genetic Future is moving

January 18, 2011

After a semi-hiatus due to various distractions, I'm about to restart blogging in earnest again over at the new home of Genetic Future on Wired Science. Please update your RSS feed: my new one is here. And a reminder: you can always keep track of new posts here as well as other nuggets of…

One more step towards the end of recessive diseases

January 13, 2011

In the last century infant mortality has declined precipitously in the Western world, thanks in large part to the development of antibiotics and vaccination. Yet as the suffering and death from infectious disease has reduced, the burden from genetic disease has become proportionately greater:…

New FireFox plugin for 23andMe customers

January 11, 2011

Software company 5AM Solutions has just launched a neat little FireFox plug-in for customers of consumer genomics company 23andMe. The idea is very simple: Download your raw data from 23andMe (or use one of the files from me or my colleagues at Genomes Unzipped); Install the…

Why you CAN have your $1000 genome - so long as you learn what to do with it

January 7, 2011

As part of his Gene Week celebration over at Forbes, Matthew Herper has a provocative post titled "Why you can't have your $1000 genome". In this post I'll explain why, while Herper's pessimism is absolutely justified for genomes produced in a medical setting, I'm confident that I'll be obtaining…

Bioscience Resource Project critique of modern genomics: a missed opportunity

December 15, 2010

Late last week I stumbled across a press release with an attention-grabbing headline ("The Causes of Common Diseases are Not Genetic Concludes a New Analysis") linking to a lengthy blog post at the Bioscience Resource Project, a website devoted to food and agriculture. The post, written by two…