Nature News has a special feature on “big data” – a broad look at the demands of the brave new world of massively high-throughput data generation, and the solutions adopted by research institutes and corporations to deal with those demands.
The image to the left (from an article in the feature by Boing Boing‘s Cory Doctorow) is a picture of the office door of Tony Cox, head of sequencing informatics at the Sanger Institute in Cambridge, UK. The 320 terabytes refers to the scale of the raw data being produced by the Sanger’s next-generation sequencing machines as they chew through kilometres of DNA, including their share of the ambitious 1000 Genomes Project. (The article mistakenly attributes the 320 Tb number to a single run of a Solexa next-gen machine, whereas it actually refers to the data generated by several such machines over a period of time; still, the real numbers are pretty damn impressive.)
The article provides some insight into a dramatic shift in the landscape of human genetics: we are no longer seriously limited by our capacity to generate biological information, but rather by our ability to store, transport and analyse the obscene amounts of data generated by high-throughput techniques. Once upon a time, most biologists could safely manage their results with a few lab books and a basic spreadsheet. Today, even small labs are learning how to cope with gigabytes of image, gene expression and sequencing data. Over the next few years those demands will only increase as technology becomes cheaper, and the publishing imperative (or less cynically, sheer scientific curiosity) drives all of us towards larger and more complex data-sets.
That will result in a pretty steep learning curve for many bench biologists. Major sequencing facilities can afford to invest in things like 1,000 square metre server farms with a quarter left fallow for seamless technology upgrades, and they have the experienced staff to build and manage such resources to support their researchers. Most biologists in small labs, on the other hand, have little or no formal training in data management and analysis. Many of us have been forced to pick up computational skills on the fly, resulting in some innovative approaches (I still see biologists reformatting and analysing large data-sets using Word and Excel – it’s amazing what some judicious cutting, pasting and find/replacing can do in the hands of a clever non-programmer) but often far-from-ideal outcomes, such as data loss and failures to take full advantage of rich experimental data.
Any readers currently in the early stages of a career in biology should take heed: develop the skills required to navigate large, complex data-sets and you’ll be a hell of a lot more valuable to a potential lab head than if you were just another pipette-monkey (no offence intended to pipette-monkeys, of course; yours is an ancient and honourable profession, etc.). Even basic familiarity with a scripting language like Python or Perl and a statistical package like R will give you an edge by allowing you to automate tedious data entry and formatting tasks and make customised analysis tools; and if you end up as the go-to person in your lab for anyone with an informatic problem you can secure middle authorship on papers with minimal effort on your part – a neat trick for a young researcher.
For those of you not pursuing a career in genetics, the era of big data will still have its impact on you: the data now being generated by large-scale sequencing facilities, and the technologies used to generate them, will ultimately help to usher in truly predictive, personalised medicine. I’ll be posting a lot more about this process over the next few months, so stay tuned.