The first 10 participants of the ground-breaking Personal Genome Project (PGP) will be receiving a hefty chunk of data today: the sequence of the protein-coding regions from many of their genes (collectively known as the "exome"). And if all goes according to plan, they'll soon be dumping all of that data on the web for anyone to access.
The PGP is an audacious endeavour led by Harvard's George Church (recently profiled in Wired). The ultimate goal of the Project is to sequence the entire genomes of 100,000 volunteers, and release both genetic and medical data from those volunteers to the research community - and indeed to anyone else who wants to view them. The Project has drawn both acclaim and criticism from the genetics community, with much of the criticism being directed at its unusual concept of genetic privacy - essentially, the Project's leaders argue that the reality of modern genomics means that the concept of patient anonymity no longer applies.
As a first step the PGP has been releasing information from its first 10 volunteers. When it comes to the understanding of genetic information they're an impressively well-informed group, including Church himself, entrepreneur Esther Dyson, linguist Steven Pinker, and academic and blogger Misha Angrist. Public profiles for the PGP 10 - including some fairly sensitive medical information - have been up on the PGP website for some time, but thus far there have been no genetic data attached to the profiles. That's set to change soon, so long as the participants don't suffer a last-minute case of cold feet and decide to keep their information out of the public domain.
Apparently the information being released to the PGP 10 today consists of around 20% of each volunteer's exome, a total of less than 1% of a complete genome sequence - but with the promise of much more to come. Ultimately, the PGP aims to provide complete genome sequences for all of its volunteers, which will become more and more feasible as the cost of DNA sequencing continues to plummet.
For genetic voyeurs, the identities associated with each of the public profiles (which are currently indicated by number alone) have been worked out via some internet sleuthing by Blaine Bettinger. Presumably the genetic data - when it's finally released - will be accessible via the same profiles.
There won't be any major medical breakthroughs from analysis of the PGP10 data, but this is a tremendous first step in the direction of personalised medicine. It's also an important experiment to see whether the noble open-access model of the PGP can survive contact with reality. As Church notes in the NY Times article: "We don't yet know the consequences of having one's genome out in the open. But it's worth exploring."
Anyone who's interested in getting their genome sequenced by the PGP - and sharing the resulting information with the world - should consider registering for inclusion in the next phase of the Project.
Presumably even more people who are able to give highly informed consent and who are in genetic and other biomedical fields can register for this next trial.
Perhaps a strong effort by faculty, postdocs, grad students, and even undergrads - we who study genetics and related fields - to register for this project might help bring some visibility to the project.
It looks like a lot of post-processing will be necessary to make sense of the raw data.
I took a quick peek at the data file for participant PGP1. It has 55000 records in FASTQ format, without any annotations at all, not even the name of the gene. The raw sequence data ranges in length from a few dozen to a few hundred bases (and many records had no usable data at all).
I put the sequence data from the first record through BLAST. It is a perfect match for a zinc finger protein, also found in the chimpanzee, so I guess that particular record doesn't reveal much about its owner :)
Great timing - I just posted about the PGP sequence data here.
Assuming that the PGP don't release more processed data shortly, I'll run some alignments and see what I can find - but as I say in my post, the coverage is so low that these data are unlikely to be particularly informative by themselves.