A recent PLoS Genetics paper triggered a sea change in the way genetic data is handled by research institutions like the NIH, the Broad Institute, and the Wellcome Trust. The paper, which came out last month, demonstrated that it's possible to identify a single individual's DNA in a pool of DNA from thousands of different people - something previously assumed to be about as feasible as finding a needle in a haystack.
Using the cumulative effect of tens of thousands of tiny differences in each individual's DNA (called SNPs), a team led by David Craig were able to determine if a specific person's DNA was represented in an aggregate of thousands of samples. This means that you could, in principle, comb through the collected data from a genome-wide association study (or GWAS) on a disease such as cancer, and determine if the individual you are interested in participated in the study - and if they were in the control group or the affected group. All you would need to do this is a sample of DNA from the person of interest (or the GWAS data based on that person's individual DNA). Close relatives could also be identified, if they donated DNA to the study, because they will share many of the same SNPs as the person of interest.
Until now, summed data resulting from pooled samples was considered safe to release to the public in a database, on a website, or as supplemental data to a publication, because it represented SNP frequencies across a very large group of people. It seemed that any one person's contribution to the group would be swamped by the contributions of others. But as the number of SNPs assessed in each genome has increased, giving high-density coverage of tens of thousands of locations across the genome, the number of comparisons the researchers could make between an individual's DNA and the group's pooled DNA also increased - to the point of determining if the person could be in the sample or not. The researchers note, "while in hindsight this conclusion seems obvious, it represents a fundamental paradigm shift."
Although the process is too technically cumbersome for abuse to be easy, geneticists must now acknowledge that an identification could be made, and that technical advances will, over time, render the identification of individuals in these datasets progressively easier. Data collection is accelerating: just a few years ago, only a handful of genetic studies had the power necessary for this approach. But now, GWAS are underway on everything from cancer to smoking to mental illness, looking for genetic linkages that could help generate new treatments. Ideally, all of that data would be available to researchers as quickly as possible - if there weren't privacy concerns to contend with.
Shortly after the paper was published, NIH Director Zerhouni and NHLBI Director Elizabeth Nabel responded in a letter to Science: "This scientific advance may have important implications for forensics and for genome-wide association studies (GWAS). It has also changed our understanding of the risks of making aggregate genetic data publically available." In light of this new information, Zerhouni and Nabel announced that the NIH would remove aggregate GWAS results from the open-access database used by researchers, dbGAP, and continue to evaluate their GWAS policies.
Now, scientists who want to look for genetic correlations in GWAS datasets held by NIH (or Broad or Wellcome) will have to apply for access to specific datasets. This process will take additional time. David Craig commented in Nature News, "I understand their concerns; they are just being safe. But it could hamper data sharing, which has facilitated so many discoveries." Wellcome's Alan Schafer hopes it won't take more than a week for researchers to gain permission to obtain data from Wellcome, but he acknowledges that if the access policy "remains restrictive we will have to see. It is part of our learning how to protect information."
Craig and his colleagues didn't set out to alarm people about GWAS. In their paper, they frame their strategy in terms of forensic analysis: "Our results show a remarkable ability to identify trace amounts of an individual's DNA within highly complex mixtures. These results further suggest novel forensic applications where the existence of DNA from numerous other individuals currently hampers the ability to identify the presence of any single individual." In short, you could use this technique to determine if a suspect's DNA is represented in a dirty, mixed sample from a scene - or conclusively rule it out. If so, a stumbling block for genetics researchers could turn out to be a boon to cops.
So - should people be concerned about donating DNA to medical research? I don't think so. First, investigators would have to be searching for you specifically, and possess a high-density dataset representing your DNA, in order to use this method. Second, the information they would get is limited: basically, just the fact that you were or were not included in a study on a certain medical condition, and whether you were affected or unaffected by the condition. They wouldn't gain additional genetic information, because remember, they already have a sample of your DNA in the first place! And since the data in these genetic databases is de-identified, the only way to obtain your name, address or other personal information would be to go back to the researchers who ran the study and collected the sample, and obtain it from them (which the researchers would likely resist, since they promise to maintain the anonymity of their subjects). There are much more direct ways to steal personal information - like digging through the garbage, stealing mail, or hacking a yahoo account.
In short, the risks here seem low. I think most people would agree that the potential benefit a study advancing treatment of a disease that affects you or a family member is well worth the unlikely chance that someone would later find out you were in the study.
Still, we have to acknowledge that any risk to personal privacy, no matter how low, is real. Recent events like the MySpace patient photo controversy discussed by Dr. Signout last week show how sensitive people are to the idea that their medical information could be shared without their consent or linked to them in any way. Revealing a serious medical condition could have ramifications - problems obtaining health insurance or workplace discrimination. That's why the institutions who handle genetic data are treating this news very cautiously - and why it may take geneticists a little longer to get access to genetic data. The oceans of genetic information accumulating from GWAS are, quite simply, uncharted waters. Both scientists and policy-makers are still learning to navigate, and while they may be treading slowly, it's better than falling off the edge.
Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al. Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. 2008. PLoS Genetics 4(8): e1000167 doi:10.1371/journal.pgen.1000167
Jennifer Couzin. Whole-Genome Data not Anonymous, Challenging Assumptions. Science, 5 September 2008.
Elias A. Zerhouni and Elizabeth G. Nabel. Protecting Aggregate Genomic Data. Science, 4 September 2008.
Natasha Gilbert. Researchers criticize genetic data restrictions. Nature News, 4 September 2008.
Great post, thanks!
Meow! Another cat out of the bag.
Do you know if GWAS in general are looking to correlate disease (or some other condition) with specific SNPs? Is that their purpose? Or were the researchers maybe looking at, say gene function, and had all this SNP data because they had the genomic sequence?
BioE, that's a fucking great and informative post!
Linda, you've got the right idea - they're looking for correlations between certain SNPs and the disease in question. Basically, if you find a SNP sequence that occurs much more often in someone with a certain disease, as opposed to people of a similar ethnic background who don't have the disease, it's good evidence that that SNP is located in a region of DNA that somehow contributes to the disease.
SNPs vary from person to person with no ill effects, and vary a lot by ethnicity - which is why the ethnicity of the individual is important to consider in these studies. Even if a SNP is found that correlates with disease, that SNP itself could be totally neutral - it may not even be in a gene at all! But the SNP serves as a red flag marking a region of DNA that's worth investigating further. That's when the researchers would look more closely at the genes in that area in the affected individuals. (Sequencing the entire genome of all participants is possible, but simply not necessary - SNPs are faster and more efficient.)
The more SNPs the researchers test, the finer a map of the genome they can make, and the more closely that can narrow down those genetic regions of interest that correlate with the disease. That's why studies are testing more and more SNPs, until the statistical power got so high, these authors were able to identify individuals from the pooled data.
Thanks! This post has me thinking too (I know only a smidge about SNPs and nothing about them in humans) about the models of nucleotide substitution I learned about in a phylogenetics class.
I wonder if certain transitions/transversion (I could never keep them straight) mean different things. I guess, like you said, it's more about the correlations they observe.