Will data-euphoria kill scientific theory?

By apalazzo on July 27, 2008.

Now that I have a good chunk of time where I'm not scheduled to run off to some distant land for vacation or to give some talk, I have decided to work extra hard. Right now I'm incubating my samples. This post is the result of me killing that time.

I want to bring up an article that appeared n WIRED over a month ago. I know, that's ancient history in the world of blogs, but it's an idea that pops up once in a while and it is common in certain young naive scientists. Let me just quote a passage from the article:

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

The big target here isn't advertising, though. It's science.

....

But faced with massive data, this approach to science -- hypothesize, model, test -- is becoming obsolete.

And here is the part that I want to focus on:

In short, the more we learn about biology, the further we find ourselves from a model that can explain it.

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

OK this is just plain naive. Collects tons of random data and out pops ... what? Correlations? A Google search? Is this the advent of Deep Thought? If so, I'm afraid that the answer will be as meaningful as 42.

I think that the underlying problem with the whole concept of replacing the scientific establishment with a Google like data cruncher is that it misunderstand how scientific insight is achieved. I would like to point out two trends in the biological sciences that have produced this Google-induced data-euphoria.

1) Big Biology. If you haven't noticed one of the biggest trends in Biology has been what many have called "Big Biology". This idea has been around with us for about a decade and I have to say that the results have been substantially ~~bellow~~ below par.

But first lets back up and ask the question What is Big Biology? It's not biology performed a big lab, or biology done with a ton of resources (although big biology does necessitate a large investment of funds). Big Biology is the act of collecting as much biological data on the molecular level as is possible with our current technology. Examples are the interaction map between every protein in the yeast genome, and the analysis of every possible mating between non-lethal yeast mutants. Often the authors will toss in an "-ome" suffix somewhere to give the study some gravitas, as if this work was a long lost cousin of the grand poobah of all big biology projects, the sequencing of the human genome.

2) Systems Biology. This new concept is the study of how a biological system works as a whole. I once questioned whether this avenue was going to lead us toward useful insights, but I now believe that there is some potential. Unfortunately the field is populated with scientists who think almost like Chris Anderson. They are the Systems Biologists who perform Big Biology. Thier moto is that you can just mindlessly data-mine large data sets using microarrays and other large scale tools and discover correlations (i.e. biological truths). Now most of these are a step ahead of Mr. Anderson. They dive into the data with a theory of what might be there. If a correlation exists, that is a good start. However I must sadly report that few of these data-mining expeditions have left strong marks simply because not much insight has come out.

So what have we gained from these data-euphoric approaches to biology?

The big success story is the sequencing of various genomes. These projects gave the rest of the community of biological scientists the necessary info to make our experiments all that easier. We'll never have to sequence an unknown gene or wonder if there may be any unknown homologs of our favorite protein lurking in the shadowy debts of chromosome 15. We now know how conserved one protein or gene might be. In addition we've been able to propose different theories with respect to how genomes are organized, how are they used, what parts are conserved and how different aspects of our genetic information relate to cellular and organismal function. In my own studies I've used genomic information to test and support my theories and to generate new models that could then be subjected to further testing.

The rest of data-euphoric biology? It's a mixed bag. First of all many of these other Big Biology projects were over hyped. For about five years (roughly 2001-2005), they were all the rage. If you browsed through up an issue of Nature or Science of that period you would come across a number of these big biology papers (interestingly Cell did not jump onto the big biology bandwagon). We small biologists read, or at least glimpsed at these papers and shrugged our shoulders. Did we learn much? First of all, we didn't know what the data meant. And second, we weren't sure about the quality. I distinctly remember scanning the results of one paper and finding out that all the proteins I was interested in apparently associated with hexokinase, a metabolic gene. Gazing at the data further, we noticed that hexokinase bound to about a third of the genome ... we mentally flushed the paper down our mind's toilet bowl. Does anyone ever reference these works? Presently the data collection seems to be more rigorous, but these paper never generate much insight. They have yet to be useful in my own studies, and they have yet to substantially change any of our ideas of how biological systems work. The impact of these studies will only be felt once the data can be incorporated into models and theories that can be tested further.

I'll present two examples:

1) ENCODE. The results of this data-mining project were interesting. We found out quite a bit about what might be happening at the genomic level - but only in that the ENCODE data suggested certain models of how the genome was utilized. One prediction from the ENCODE study is that transcription is generally sloppy and that post-transcriptional processes may play a greater role in determining expression patterns. But this theory needs to be tested before the results of ENCODE can impact biology in general. As always, theory generation and testing are key in furthering our knowledge.

2) My postdoctoral work consisted of analyzing how newly synthesized mRNAs were exported from the nucleus to the cytoplasm. From this work, I predicted that there must be something special about the nucleotide sequences that encode signal sequences (what we call signal sequence coding regions or SSCR) as these RNA elements could promote nuclear export all on their own. This was our prediction (i.e. theory, model etc.).

What to do next? Test it!

In collaboration with Mike Springer, we analyzed the open reading frames from various species and found that vertebrate SSCRs are indeed different, they have a very low A content. Next we construct an updated form of that theory. The low A content of SSCRs is critical in their ability to promote export. We then tested the theory by introducing silent A mutations in the SSCR and then determine if these altered RNA elements promoted export. My experiments demonstrated that the model was correct (although some export activity remained indicating tat there may be residual activity.)

But you can see how the chain of events worked.

Cell biological assay (Small Biology) => prediction = > test the theory using data from Big Biology => prediction => test the theory generated from Small and Big Biology using cell biological assays.

The overall problem with throwing out models and theories is ... THAT'S HOW WE UNDERSTAND HOW STUFF WORKS. At the end of the day, I had a tested theory about how mRNA export works. Without a model you have no insight no deeper understanding. Think about it this way How do you expect to get insight from Google? Google is only a tool, no more. It can be a useful tool for someone who wants to collect data or learn about new subjects. If it's a really good tool, and you are a bright individual who can harness its powers, you can even use it to TEST YOUR HYPOTHESIS ABOUT HOW THE WORLD WORKS. If you ask Deep Thought for "some profound answer", it will just spurt out "42", but as we all know without the concepts that form the basis a theory and an associated question, the answer is meaningless.

But fear not big-biologists! Using data gathered from Big Biology, we might be able to construct some theories and models, and then we may even test these ideas. And best of all, you can test the idea using Small Biology experiments or Big Biology experiments. The notion that the tool will circumvent the process of theory and testing is nonsense.

For other comments on the WIRED article see: evolgen, Stranger Fruit, bbgm, Statistical Modeling, Causal Inference, and Social Science, Zero Intelligence Agents, Good Math, Bad Math, and Maxine Clarke over at Nature Networks.

More like this

Yeah, it's a total load of shit. It's the rhetorical refuge of people with a metric fuckload of data, but no way to make sense of it.

However, I would like to distinguish between science that is not "hypothesis-driven"--which I think is absolutely fine--and science that is bereft of models--which I think is not.

I think there's utility--and there's good stuff in there. Kinda like having the Census. You gotta know what's out there, and having a count is nice. But data for data's sake isn't enough.

The key step is getting the data into the hands of the people who can do the closer look at it, based on their areas of interest, and use the data effectively. That is a piece that is often missing. Many large-scale data producers/projects (and software developers) assume that if they put out the data and the software the bench biologists will just show up and go wild. You know, the "miracle happens here" step.

We are trying to bridge that. And sometimes it works. It is actually very cool when it does--you can see in the training room that someone realizes we just offered them a treasure chest. They still have to dig some, but at least they see what's in the pile and are starting to think about ways to go after it.

This idea has been around with us for about a decade and I have to say that the results have been substantially bellow par.

bellow = > below

You have a point. Certain scientific endeavours, for example much of structural biology, are not "hypothesis-driven". Theory might be used to construct a scientific program, but all good science generates new theories or advances our pre-existing theories. Of course these new/updated theories spawn new experiments and so on.

I agree overall, being a small biologist :-) who tried to hop onto that bandwagon to some extent. The transcriptomes we examined are practically unpublishable today (we dawdled too long). But (a) luckily I am tenured and have a very small group, and (b) they are great, and nearly limitless, hypothesis generators. So test, test, test - and bring it full circle. Right on.

The great--and terrible--thing about almost any form of data analysis is that you WILL get an answer. The problem for us in our own research is knowing which tool to use to answer the question we're working on, and the problem in evaluating others' research is to decide whether their results mean what the authors claim, or indeed mean anything at all.

There is also the tendency for Big Biology people to think that if you're not doing Big Biology, you're just a stamp collector who couldn't hack it with the big boys. Those of us with an attention span longer than an NSF funding cycle can remember several other It Fields that were going to put us out of business. We're still here. So are people in the former It Fields, but they're our colleagues, not our masters. It makes me wonder if academics are destined to keep swooning for the newest/lastest/greatest forever, and why it happens in the first place. Media hype? Fads in funding? Simple human love of novelty?

Anyway, nice post.

LOL - 'brute force' science. Who'd have thought?

Those of us with an attention span longer than an NSF funding cycle can remember several other It Fields that were going to put us out of business. We're still here.

True dat!

So are people in the former It Fields, but they're our colleagues, not our masters.

A lot of them are either technicians or salespeople. I laugh my ass off when I see short-sighted institutions filling entire departments with "systems biologists" who are just glorified code jockeys and wouldn't know a pipetman from an electrode.

These people are gonna be about as relevant in five years as the people who were experts at purifying enzymes became in the late 1980s.

There is also the tendency for Big Biology people to think that if you're not doing Big Biology, you're just a stamp collector who couldn't hack it with the big boys.

Please distinguish what we think from what the nitwits at Wired think.

That article was very provocative so it was successful but so ignorant about how theories are used in science it made me cringe. Even the big science biology data storms aren't the result of "not having a model to explain it" but rather the result of finally having a good model to target data gathering in areas that might be useful (using the model). So it is just backwards. Experimental science is MORE dependent on scientific models in "big science" than ever before. Good theoretical frameworks are the only way to target data gathering to actually apply hypotheses to it. How would one even decide what categories to use to get correlations out of a blob of data? The models of course.

To give an example from another field which is currently going to generate the most data of any science project ever, the Large Hadron Collider will be gathering petabytes of data and more over time. One ought to get tons of great correlation's that could be used to make a new theory. Yet the data retrieval process itself is entirely bound up in very detailed Standard Model projections of what should be there. If the theory is wrong the LHC could very well miss something. Physicists worry about that - there have been loud arguments about what model the detectors should use and there will be more.

The current level of understanding in biology and physics and chemistry, along with storage and processing technology I think are finally allowing the application of theories to really large data sets. It is the confidence in the models that is underneath the ability to gather and understand monstrous blobs of data - not the other way around. IMHO of course.

This notion of Big Biology or Systems Biology and its apparent utility has gained a serious foothold in universities over the last 15 years with the human genome project as the model poster child. Your juxtaposition of 'Big Biology' against the more traditional modes of scientific inquiry is intriguing. We have this saying in our lab, "if you're out of ideas, do a screen." Thus, I'm not entirely sure that they're in opposition at this point but kind of a plan B - unless everyone is out of ideas, which may be the case. Along those lines, I definitely hold a certain level of disgust for the lack of insight and critical thinking that goes into these 'non-biased' screens or -eomic approaches. Case in point, after I saw Sir Paul Nurses's recent contribution to Nature with his piece "Life, Logic and Information," I really felt that Paul has lost his mind and has forever gone the way of university administrator and has been seduced by this data-euphoria that you mention. He's either out of ideas or is now a data-euphoria-junkie.

Is this our ultimate fate as scientists?

"SYSTEMS" biology is a buzzword, a glib and crass marketing term for a set of approaches that underpin much of modern biology and have for a century. What I find most offensive about it is that this particular piece of bullshit doublespeak is used *overtly* to bury the proud history our field. By pretending that these approaches are new, we imply that our foebears were ignorant of them. That we are smarter than they were. That what they did did not lead to "real" understanding. That what they did does not matter, and that you all need not trouble your pretty little heads with it. (Particularly if you're thinking about donating money to us.) Much of the prattle about "systems" biology is pro-hype, pro-marketing, ahistorical, anti-intellectual, and anti-sholarship.

It's disgusting, and it disappoints me to see some people whom I've known for some time and who really *do* know better spearheading the charge.

ANYONE who tells you that that getting physicists and engineering approaches into biology is new is either abysmally igornant of the field's history, trying to sell you a cartload of warmed-over horseshit, or (most likely) both. The agressive use of mathematical modeling? Go back to the turn of the century. Not the 21st century. The 20th. Morgan, Muller and their intellectual heirs. Look up Kimura. (If you don't know who he is, kindly shut the f-bomb up about "systems biology.") Physics and chemistry? Schroedinger, Pauling, Perutz, Huxley, Hodgkin, Huxley, Crick, Benzer, Boxer, Neher, Sakmann, Hille, Chiu, Ashkin, Berg... The list goes on and on.

What is new is high-throughput biology. Assembly-line biology. Factory biology. This is a function of real advances and it offers real opportunities for framing and testing new hypotheses. But call it what it is, or you're a fool, a charlatan, or worse.

Two final points. The most important advance in basic biology in the last decade is the recognition that RNA-based regulation is central to most eukaryotic biology. The first point is that none of the "systems" approaches led in a meaningful way to the breakthrough (though high-throughput and bioinformatic approaches certainly have been useful subsequently). The second point is that RNA-based regulation had been exhaustively documented in prokaryotic systems for almost two decades before the work of Fire and coworkers. To this day, the eukaryotic folks still generally don't cite that work.

We bury our own history at our peril.

There is also the tendency for Big Biology people to think that if you're not doing Big Biology, you're just a stamp collector who couldn't hack it with the big boys.

Please distinguish what we think from what the nitwits at Wired think.

Hey, I said 'tendency', not Iron Rule. If you're a Big Bio person who doesn't think that way, good for you. But if you think that attitude doesn't exist in Big Bio, talk to some 'small' biologists.

My dean is a Big Bio person and in this college, if you're not doing something that costs half a mil to get started on, you're a nobody. Which is why I'm posting anonymously; I can't afford to express this at work. What George Smiley described in the comment above is the reality that some of us have to live under.

George,

I totally agree, Systems Biology, in the sense of modeling how the biological system is set up, isn't really a new idea. And I also like this:

What is new is high-throughput biology. Assembly-line biology. Factory biology. This is a function of real advances and it offers real opportunities for framing and testing new hypotheses. But call it what it is, or you're a fool, a charlatan, or worse.

Yes, Big Biology = Factory Biology. I think that it can be useful in certain circumstances, but so far it's been overhyped. And what upsets me is when it is done mindlessly for the sole act of acquiring data (aka non-hypothesis-driven research). To think that this brainless factory biology will replace small biology and the scientific method is idiotic.

Data are a necessary but not sufficient component of science. Thus, the pendulum will continue swinging from data to theory and (through predictions of theory) to data.

With the ENCODE release the orthodoxy of Genomics was blown to pieces ("the concept of gene is a myth") - resulting in the ENCODE-architect' Francis Collins' call "the scientific community will have to re-think long-held beliefs".

"Re-thinking fundamentals", however, is not as easy (even if US science establishment would be theory-friendly, which it is not). To get a second grant, you have to be sure that you deliver on the first one. Thus, if you promise to kill n-number of cats (shorthand for producing data) you can relax, since you have "job security" in mediocre science (a contradiction in terms).

The pendulum swings back, nonetheless, for two cardinal reasons.

1) Data-deluge is not only extremely expensive, but inability of processing data is very embarrassing (but "data will not kill theories").

2) Advanced theories do emerge (not necessarily by US-educated scientists) and they kill not only old theories, but reveal that much of the data-deluge is quite unnecessary (thus is wasteful) compared to going after experimental prediction of theories based on a novel set of axioms, after the dignified removal of dogmas.

For an example, see The Principle of Recursive Genome Function

pellionisz_at_junkdna.com

I'm not sure that ENCODE demonstrated that the "concept of gene is a myth"??? At best what ENCODE demonstrated is that transcription is sloppy and that there are lots of DNA regulatory elements in the genome. If anything it suggests that gene expression is more dependent on RNA metabolism than previously thought.

Let me add something here with regards to big overarching theories. The most important theories being developed right now in biology are on the micro-scale such as "how are proteins pumped into the ER", "how is the dorsal-ventral axis gradient in development established", "how does actin polymerization drive cell locomotion" or "how is gene expression regulated" ... these types of theories require intense scrutiny of data. They are highly mechanistic, almost to the atomic level. Right now the best science in molecular/cell biology is an intense cooperation between data and theory, one playing off of the other. Pick up any issue of Cell, Nature and Science and you'll see where all the progress is being made. The big splashes are not coming from pure theorist or brainless data factories.

And as far as I know, there is no difference between US- and foreign-trained scientists. I'm not sure what you are talking about.

The gross simplifications of decades ago have fallen away as the gross simplifications that they always were. That isn't a surprise to me, probably not to most scientists, perhaps it is a surprise to those who believed the hype. There were those who believed that simply by looking at DNA sequences the scales would fall from our eyes and we would understand everything. They were wrong, breathtakingly wrong. Their belief was based on hubris, not understanding.

There were some who didn't want to sequence the non-gene DNA to save money and because it was "junk" anyway. Good thing those people were not listened to. Because the non-gene DNA was sequenced, we were able to find out that some of it is more conserved than are any genes; conserved over 500 million years. Maybe someday we will understand what those "ultra-conserved sequences" actually do.

http://www.biomedcentral.com/1471-2164/5/99

Biology is many orders of magnitude more complicated than physics. Systems of fives, of coupled non-linear parameters are chaotic and inherently unpredictable. What happens with tens, or hundreds, or thousands, or tens of thousands? No model can predict their behavior in detail long term; even when that model is 100% complete and exactly precise.

Some types of complexity scale as n!. The human genome is ~20,000 genes. 20,000! is about 1.8x10^77,337. Google has mapped a few tens of billions of pages; only a few tens of thousands of orders of magnitude in complexity to go. Good thing the end is in sight ;) But then how much of the 95%+ non-gene coding DNA is important? for what? in what tissue compartment(s)? under what circumstances? How much complexity does that add? We know the answer is exponentially more.

Functional neuroimaging research in psychology has gotten to be very much like this. "The brain: It does stuff!" Although there are some theorists directly testing their ideas, and it can be valuable for others to throw lots of things at a brain region to find what it processes, some of the magic of science is disappearing. It's as though less thought is going into things. Moreover, the behavioral psychology researchers and the neuroimagers don't always do a very good job of communicating--one camp could very nicely inform the other, but they are all very devoted to their own pet fields, so they fail to acknowledge this.

What we wind up with is fabulous information like, "People with disorder X are different from other people in brain region Y." However, we don't know how/if this is reflected in behavior, and we don't talk to biologists enough to have any idea how the brain develops in the first place (which would help explain how the difference came about).

Ah, the rants of the dissertating student....

I recently read an article somewhere (I can't remember where) about this very thing. The main point that author made was that the PROBLEM with data-based "studies" is that they DO identify correlations. The problem is that many of the correlations they identify really don't have any meaning. In order to separate the meaningful correlations from the garbage, you have to have a model, or a theory, or both. Data alone aren't enough!

Alex,

Francis Collins initiated ENCODE in 2003 when it was already clear (at least to some) that to segment the DNA to "genes" and "junk" was fundamentally flawed. It took 4 years and too-many-tax-dollars-to-mention of data-gathering for Big Science to (publicly) fess up that data simply does not support the flawed dogma.

It is noteworthy that upon conclusion of the release of "pilot-results" in 2007, the same Francis Collins issued a call that "the scientific community has to re-think long-held beliefs"; i.e. he thought that some pause was necessary in data-gathering to revise axioms that were likely to be dogmas. Instead, NIH secured some more money for the next round of data-gathering (to continue ENCODE) - and effective this week Francis Collins resigned from NIH. (Apparently it is not a very pleasing job to try to dish out money for more data-collection when the budget is stagnant, and there is not much response to "re-thinking long-term beliefs").

To see that ENCODE profoundly affected the (now obsolete) definition of "gene" and early attempt is this paper , whose re-definition did not seem to stick.

As of this July, popular journals such as New Scientist blare headlines "Forget Genes" and support "Rewriting Darwin" with scores of science journal articles.

It is respectfully submitted in positively answering your kick-off posting that data-euphoria will NOT kill scientific theory.

I don't know too much about science education in Asia, Latin-America or Africa, but assert that there is a difference between European and USA science curricula. (At least my 40-year Academic career, about half on each Continent seems to support this impression). US tends to go for data-generation by Big Science, while Europeans watch for theoretical breakthroughs. Perhaps this is why our International PostGenetics Society ("Genomics beyond Genes") could hold its European Inaugural in 2006 (to have become the first organization that publicly abandoned the "junk DNA" misnomer as a scientific term) while US-led ENCODE fessed up later, only in mid-2007.

pellionisz_at_junkdna.com

How do you expect to get insight from Google?

Go to google and type "Alex Palazzo" and "systems biology". You'll get 150 hits. Then type "systems biology" and "postdoctoral fellow". You'll get 49,200 hits. Then type separately "Alex Palazzo" and "postdoctoral fellow". You'll get 2,770 and 1,270,000 hits respectively. Normalizing 49200 to 1270000 and 150 to 2770 we'll get 3.8% postdoctoral fellows and 5.4% of Alex Pallazo are associated with systems biology. So Alex Palazzo is 1.4 times more interested in systems biology than an average postdoc.

That was a hypothesis-driven exersize: "does Alex Palazzo secretly wants to become a Big Biologist?" :)

We're still here. So are people in the former It Fields, but they're our colleagues, not our masters. It makes me wonder if academics are destined to keep swooning for the newest/lastest/greatest forever, and why it happens in the first place. Media hype? Fads in funding? Simple human love of

The great--and terrible--thing about almost any form of data analysis is that you WILL get an answer. The problem for us in our own research is knowing which tool to use to answer the question we're working on, and the problem in evaluating others' research is to decide whether their results mean what the authors claim, or indeed mean anything at all.

It's disgusting, and it disappoints me to see some people whom I've known for some time and who really *do* know better spearheading the charge.

Will data-euphoria kill scientific theory?

More like this

My Year in a Picture

For those interested in the organization of trust in the scientific establishment

Trust & Influence - The Real Human Currency

Slicing a famous brain, streamed live on the web

NIH Grants by Age

Messier Monday: The Whirlpool Galaxy, M51

X-Volcanoes - Bárðarbunga

Comments of the Week #98: from Earth's extinction to chocolate's physics