$1,000 Genomes and Metadata

Matthew Herper rounds up some of the discussion about the decreasing cost of genomics. But one thing that hasn't been discussed much at all is the cost of all of the other things needed to make sense of genomes, like metadata. I briefly touched on this issue previously:

A related issue is metadata--the clinical and other non-genomic data attached to a sequence. Just telling me that a genome came from a human isn't very useful: I want to know something about that human. Was she sick or healthy, and so on. These metadata too, will have to be standardized: I can't say one of genome came from someone who was "sick", while you provide another genome from someone who had "inflammatory bowel disease." Worse, I can't say my patient had IBD, while yours had Crohn's disease. The data fields have to be standardized, so we're not comparing apples and oranges.

But I think this has to be considered much more in depth when thinking about the cost of genomics (as opposed to sequencing).

It isn't that expensive to genotype, and very soon, to anonymously sequence the first 30,000 people that left North Station on a given morning. But that wouldn't be very useful. Why? Because we need to know all sorts of characteristics. That's why the various genotyping cohorts--groups of well-characterized subjects--are so critical. Importantly, when someone thinks of a new question (i.e., metadata that haven't been collected), it's often possible to return to those subjects, so we don't have to sequence or genotype new people.

This hasn't really been considered too costly, since the cost of sequencing/genotyping (and the other needed molecular biology steps) have been far more greater than the cost of collecting the relevant clinical information. But that's going to become relatively more expensive. Before people start ranting about 'regulatory issues', I'm simply talking about the cost of hiring people to collect and manage the metadata. That's not cheap, and would be a substantial cost.

As the discussion over 'missing heritability' indicates, we're not quite ready to 'go full diagnostic' yet (not that there aren't useful things being done; there are). We still need good clinical metadata.

If you're in the advanced metadata class, you might ask, "Why don't we just pull the data from the patients' records?" Well, that's a huge challenge. Medical systems organize their data in different ways, so avoiding the apples-and-oranges problem isn't trivial. Likewise, there would probably be metadata that you wouldn't want to collect--those have to be stripped out. This has been proposed regarding antibiotic resistance, which has much simpler data and metadata, and has come to naught, so I'm not entirely optimistic.

If you want to sequence your genome for your own needs, it could be really cheap. But to place it into context will require cost more, maybe a lot more..

Tags
Categories

More like this

This topic (the phenotype) is covered a bit in the book Here is A Human Being: At the Dawn of Personal Genomics By Misha Angrist.

By NewEnglandBob (not verified) on 18 Jan 2011 #permalink

Interesting post - and I think the issue of meta-data will be a problem in the future. It does have to be good, especially in an environment where data sharing, openness and collaboration are being pushed heavily. It does no one any good if my annotations for a genome are in an essentially private language.

And as someone who has done data cleaning, management and maintenance for a living, yes, its probably going to cost more than the sequence itself.

I am optimistic on information mining on Metadata.. The google database is a mess but there is no problem in retrieving meaningfull data. I believe, even if the health care recodrs are unorganized using keyword based data mining aprroaches will resolve the issue. My guess is that it will
be faster and more accurate than any experts prediction.

Numerous investigators have lamented 'the missing information.' But they seem not to be taking this seriously.

Information is a real thing and obeys certain laws, including laws of quantitation. It's worth tracing the flow of information between genome and phenotype. In this case, the phenotypic information is emergent, not simply transmitted.

Until people are willing to at least give some thoughtful consideration to what is going on in terms of fundamentals, the rest of this -- massive sequencing -- is just repeating an experiment that hasn't told us what we were anticipating.

It feels as though we were in the early 19th century and madly building steam engines without knowing the rules of thermodynamics, and frustrated that our perpetual motion designs keep not working.

In this case, we are pouring billions of dollars into doing something that hasn't yielded what we hoped -- normally, one amps up an experimental protocol when it WORKS ... not double down on something that SHOULD have worked, but didn't.

There needs to be more diversity in the lines of research being supported until someone happens on the key. Right now, there are hundreds of GWA analyses generating 'almost statistically significant' linkages between genes and disease, usually with 30 authors or more. It's like witchcraft ... and in the meantime, no one who doesn't take part in this non-thinking process can get support.

Putting the above comments in another perspective ...

why are we spending hundreds of millions of dollars to try to discern the differences between a melanoma and a melanocyte, using a technology that can barely tell the difference between a chimpanzee and a human.

Time to do a rethink.

Matthew, I absolutely agree. The Genomic Standards Consortium (GSC) was established in 2005 with the aim to create metadata standards. So far we have covered eukaryotes, prokaryotes, viruses, organelles, plasmids (MIGS), metagenomes (MIMS) and marker sequences (MIMARKER). Patient data has privacy issues and has not (yet) been a focus of the GSC. Much of our efforts have gone into creating standards, because having metadata is one thing, but if not reported in a standard fashion, its use is limited. Standard compliance is another focus, and we're starting to see more and more buy-in. I use this space for a shameless plug out upcoming meeting at the Wellcome Trust Conference Centre April 4-6, 2011 in Hinxton (near Cambridge), UK. More information is available at http://tinyurl.com/GSC11meeting. The agenda should be out next week.