I was talking with a scientist last week who is in charge of a
massive dataset. He told me he had heard complaints from many of his
biologist friends that today’s students are trained to be computer
scientists, not biologists. Why, he asked, would we want to do that
when the amount of data we handle is so trivial?
Now, you have to understand, to this person a dataset of 1000 whole genomes is
trivial. He said, don’t these students understand that in a few years
all the software they wrote to handle these data will be obsolete? They
certainly aren’t solving interesting problems in computer science, and
in a short time, they won’t be able to solve interesting problems in biology.
I’d agree that biological data-sets can’t compete with particle physicists in terms of sheer scale, although the speed with which they are accumulating is alarming. Where biological data-sets really become intimidating is in their diversity, in the complexity of the underlying processes, and in the levels of noise and bias. I suspect a lot of people used to dealing with extremely large data-sets would still balk at the complexity of computational biology once they dug a little deeper, particularly in a few years’ time.
Tomorrow’s high-throughput plain-English bioinformatics tool will do
the work of ten thousand 2009 graduate students. If a freely-available
(or heck, even a paid) service can do the bioinformatics, what should
today’s graduate students be learning?
I am intrigued by the potential of natural language search algorithms,
and certainly I anticipate a future in which the combination of
well-curated, mutually intelligible biological databases and powerful
search tools makes it much easier for non-informaticians to generate
and explore hypotheses, in the same way that sites like NCBI and Ensembl
have made it simple for bench scientists to access and manipulate
sequence data. There’s no question that biologists with little or no
informatics background will be able to query increasingly complex
biological data-sets in increasingly complex ways over the next few years.
That said, such tools and databases, however powerful, will always lag substantially behind the science.
For young biologists who want to work right at the cutting edge – which
will require dealing directly with rapidly changing technologies,
generating biological data at an increasingly dizzying pace and in
constantly evolving formats – solid informatic skills, including at
least basic programming and sound statistical knowledge, will make you a far more productive scientist.
If you intend to be at the head of your field, you’ll often be in a
place where the right tools for the job simply don’t exist yet. You
need to be able to develop such tools yourself, or at least speak the
right language to communicate your needs to someone who can; and
speaking that language means having a good working knowledge of computation.
Of course programming languages will change and the scripts you write
as a grad student will be forgotten within a year or two – that’s the
nature of science (how many molecular biologists still run Southern blots?). The important thing is learning how to think about large-scale biological data:
how to access, filter and manipulate it. Having basic programming
expertise will make you more effective as a scientist right now, and it
will also prepare you for a career in an increasingly data-driven field.
Of course, informatic skills alone will get you nowhere unless your
ambition is to be the default IT support team for your lab partners. Regardless of
whether you are asking questions using John’s hypothetical universal
query engine or an algorithm of your own invention, you need to be asking the right questions, which means developing an understanding of biology that is both deep and broad. If the quoted concern in John’s post is valid – if young biologists are actually sacrificing scientific understanding for computational skills – then that is certainly something that needs to be corrected.
Still, let’s be sure not to swing too far in the opposite direction. Unless and until Wolfram Alpha triggers the singularity I’d argue that biology grad students will be extremely well-served by developing serious programming and statistical experience, at least if they want to be marching at the head of their field. Speaking as a biologist who entered informatics far too late (as a postdoc), I can think of few other skill areas where investing effort and time early in your career can provide such a dramatic return in terms of scientific productivity and career prospects.
Related: xkcd effectively says the same thing in cartoon style – and read the comments of that post for some useful tips.