Now that I have a good chunk of time where I’m not scheduled to run off to some distant land for vacation or to give some talk, I have decided to work extra hard. Right now I’m incubating my samples. This post is the result of me killing that time.
I want to bring up an article that appeared n WIRED over a month ago. I know, that’s ancient history in the world of blogs, but it’s an idea that pops up once in a while and it is common in certain young naive scientists. Let me just quote a passage from the article:
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
The big target here isn’t advertising, though. It’s science.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.
And here is the part that I want to focus on:
In short, the more we learn about biology, the further we find ourselves from a model that can explain it.
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
OK this is just plain naive. Collects tons of random data and out pops … what? Correlations? A Google search? Is this the advent of Deep Thought? If so, I’m afraid that the answer will be as meaningful as 42.
I think that the underlying problem with the whole concept of replacing the scientific establishment with a Google like data cruncher is that it misunderstand how scientific insight is achieved. I would like to point out two trends in the biological sciences that have produced this Google-induced data-euphoria.
1) Big Biology. If you haven’t noticed one of the biggest trends in Biology has been what many have called “Big Biology”. This idea has been around with us for about a decade and I have to say that the results have been substantially
bellow below par.
But first lets back up and ask the question What is Big Biology? It’s not biology performed a big lab, or biology done with a ton of resources (although big biology does necessitate a large investment of funds). Big Biology is the act of collecting as much biological data on the molecular level as is possible with our current technology. Examples are the interaction map between every protein in the yeast genome, and the analysis of every possible mating between non-lethal yeast mutants. Often the authors will toss in an “-ome” suffix somewhere to give the study some gravitas, as if this work was a long lost cousin of the grand poobah of all big biology projects, the sequencing of the human genome.
2) Systems Biology. This new concept is the study of how a biological system works as a whole. I once questioned whether this avenue was going to lead us toward useful insights, but I now believe that there is some potential. Unfortunately the field is populated with scientists who think almost like Chris Anderson. They are the Systems Biologists who perform Big Biology. Thier moto is that you can just mindlessly data-mine large data sets using microarrays and other large scale tools and discover correlations (i.e. biological truths). Now most of these are a step ahead of Mr. Anderson. They dive into the data with a theory of what might be there. If a correlation exists, that is a good start. However I must sadly report that few of these data-mining expeditions have left strong marks simply because not much insight has come out.
So what have we gained from these data-euphoric approaches to biology?
The big success story is the sequencing of various genomes. These projects gave the rest of the community of biological scientists the necessary info to make our experiments all that easier. We’ll never have to sequence an unknown gene or wonder if there may be any unknown homologs of our favorite protein lurking in the shadowy debts of chromosome 15. We now know how conserved one protein or gene might be. In addition we’ve been able to propose different theories with respect to how genomes are organized, how are they used, what parts are conserved and how different aspects of our genetic information relate to cellular and organismal function. In my own studies I’ve used genomic information to test and support my theories and to generate new models that could then be subjected to further testing.
The rest of data-euphoric biology? It’s a mixed bag. First of all many of these other Big Biology projects were over hyped. For about five years (roughly 2001-2005), they were all the rage. If you browsed through up an issue of Nature or Science of that period you would come across a number of these big biology papers (interestingly Cell did not jump onto the big biology bandwagon). We small biologists read, or at least glimpsed at these papers and shrugged our shoulders. Did we learn much? First of all, we didn’t know what the data meant. And second, we weren’t sure about the quality. I distinctly remember scanning the results of one paper and finding out that all the proteins I was interested in apparently associated with hexokinase, a metabolic gene. Gazing at the data further, we noticed that hexokinase bound to about a third of the genome … we mentally flushed the paper down our mind’s toilet bowl. Does anyone ever reference these works? Presently the data collection seems to be more rigorous, but these paper never generate much insight. They have yet to be useful in my own studies, and they have yet to substantially change any of our ideas of how biological systems work. The impact of these studies will only be felt once the data can be incorporated into models and theories that can be tested further.
I’ll present two examples:
1) ENCODE. The results of this data-mining project were interesting. We found out quite a bit about what might be happening at the genomic level – but only in that the ENCODE data suggested certain models of how the genome was utilized. One prediction from the ENCODE study is that transcription is generally sloppy and that post-transcriptional processes may play a greater role in determining expression patterns. But this theory needs to be tested before the results of ENCODE can impact biology in general. As always, theory generation and testing are key in furthering our knowledge.
2) My postdoctoral work consisted of analyzing how newly synthesized mRNAs were exported from the nucleus to the cytoplasm. From this work, I predicted that there must be something special about the nucleotide sequences that encode signal sequences (what we call signal sequence coding regions or SSCR) as these RNA elements could promote nuclear export all on their own. This was our prediction (i.e. theory, model etc.).
What to do next? Test it!
In collaboration with Mike Springer, we analyzed the open reading frames from various species and found that vertebrate SSCRs are indeed different, they have a very low A content. Next we construct an updated form of that theory. The low A content of SSCRs is critical in their ability to promote export. We then tested the theory by introducing silent A mutations in the SSCR and then determine if these altered RNA elements promoted export. My experiments demonstrated that the model was correct (although some export activity remained indicating tat there may be residual activity.)
But you can see how the chain of events worked.
Cell biological assay (Small Biology) => prediction = > test the theory using data from Big Biology => prediction => test the theory generated from Small and Big Biology using cell biological assays.
The overall problem with throwing out models and theories is … THAT’S HOW WE UNDERSTAND HOW STUFF WORKS. At the end of the day, I had a tested theory about how mRNA export works. Without a model you have no insight no deeper understanding. Think about it this way How do you expect to get insight from Google? Google is only a tool, no more. It can be a useful tool for someone who wants to collect data or learn about new subjects. If it’s a really good tool, and you are a bright individual who can harness its powers, you can even use it to TEST YOUR HYPOTHESIS ABOUT HOW THE WORLD WORKS. If you ask Deep Thought for “some profound answer”, it will just spurt out “42″, but as we all know without the concepts that form the basis a theory and an associated question, the answer is meaningless.
But fear not big-biologists! Using data gathered from Big Biology, we might be able to construct some theories and models, and then we may even test these ideas. And best of all, you can test the idea using Small Biology experiments or Big Biology experiments. The notion that the tool will circumvent the process of theory and testing is nonsense.
For other comments on the WIRED article see: evolgen, Stranger Fruit, bbgm, Statistical Modeling, Causal Inference, and Social Science, Zero Intelligence Agents, Good Math, Bad Math, and Maxine Clarke over at Nature Networks.