Oh shit, that's 320 terabytes! Dealing with data in a high-throughput age

i-0b6555b9084ec56c47fa8778d67087e0-ohshit320tb.jpgNature News has a special feature on "big data" - a broad look at the demands of the brave new world of massively high-throughput data generation, and the solutions adopted by research institutes and corporations to deal with those demands.

The image to the left (from an article in the feature by Boing Boing's Cory Doctorow) is a picture of the office door of Tony Cox, head of sequencing informatics at the Sanger Institute in Cambridge, UK. The 320 terabytes refers to the scale of the raw data being produced by the Sanger's next-generation sequencing machines as they chew through kilometres of DNA, including their share of the ambitious 1000 Genomes Project. (The article mistakenly attributes the 320 Tb number to a single run of a Solexa next-gen machine, whereas it actually refers to the data generated by several such machines over a period of time; still, the real numbers are pretty damn impressive.)

The article provides some insight into a dramatic shift in the landscape of human genetics: we are no longer seriously limited by our capacity to generate biological information, but rather by our ability to store, transport and analyse the obscene amounts of data generated by high-throughput techniques. Once upon a time, most biologists could safely manage their results with a few lab books and a basic spreadsheet. Today, even small labs are learning how to cope with gigabytes of image, gene expression and sequencing data. Over the next few years those demands will only increase as technology becomes cheaper, and the publishing imperative (or less cynically, sheer scientific curiosity) drives all of us towards larger and more complex data-sets.

That will result in a pretty steep learning curve for many bench biologists. Major sequencing facilities can afford to invest in things like 1,000 square metre server farms with a quarter left fallow for seamless technology upgrades, and they have the experienced staff to build and manage such resources to support their researchers. Most biologists in small labs, on the other hand, have little or no formal training in data management and analysis. Many of us have been forced to pick up computational skills on the fly, resulting in some innovative approaches (I still see biologists reformatting and analysing large data-sets using Word and Excel - it's amazing what some judicious cutting, pasting and find/replacing can do in the hands of a clever non-programmer) but often far-from-ideal outcomes, such as data loss and failures to take full advantage of rich experimental data.

Any readers currently in the early stages of a career in biology should take heed: develop the skills required to navigate large, complex data-sets and you'll be a hell of a lot more valuable to a potential lab head than if you were just another pipette-monkey (no offence intended to pipette-monkeys, of course; yours is an ancient and honourable profession, etc.). Even basic familiarity with a scripting language like Python or Perl and a statistical package like R will give you an edge by allowing you to automate tedious data entry and formatting tasks and make customised analysis tools; and if you end up as the go-to person in your lab for anyone with an informatic problem you can secure middle authorship on papers with minimal effort on your part - a neat trick for a young researcher.

For those of you not pursuing a career in genetics, the era of big data will still have its impact on you: the data now being generated by large-scale sequencing facilities, and the technologies used to generate them, will ultimately help to usher in truly predictive, personalised medicine. I'll be posting a lot more about this process over the next few months, so stay tuned.

Subscribe to Genetic Future.

Categories

More like this

Daniel G. Hert, Christopher P. Fredlake, Annelise E. Barron (2008). Advantages and limitations of next-generation sequencing technologies: A comparison of electrophoresis and non-electrophoresis methods Electrophoresis, 29 (23), 4618-4626 DOI: 10.1002/elps.200800456 The dideoxy termination method…
John Hawks recounts a recent conversation about bioinformatics: I was talking with a scientist last week who is in charge of a massive dataset. He told me he had heard complaints from many of his biologist friends that today's students are trained to be computer scientists, not biologists. Why, he…
I've just discovered a very promising new blog in the genomics sphere (well, technically it's a newly relaunched blog) run by a group at the University of Birmingham. Two posts by Nick Loman are of immediate interest to readers here. Firstly, I highly recommend Nick's thorough dissection of…
I'll be at the Advances in Genome Biology and Technology meeting next week - this will be my my first experience of this annual conference on Florida's picturesque Marco Island, but I already have high expectations based on reports from previous years. The programme is packed with cutting-edge…

Don't leave out the language R. It's terribly useful and has a wonderful community (and it's free)

I cannot agree enough. As a former/present pipette-monkey and someone who wishes he had taken the time to learn something simple like perl, I wholeheartedly agree. It is unfortunate that now that I am a PI, I don't have the time to take an immersion course in these languages. I have to send my grad students and rely on them to write the scripts that I envision, then hope the scripts do what it is that I asked.

sir,

vulgarity is bad form in your titles. after all, your headline is displayed on the front page.

sincerely,
c.v. snicker

peter,

Excellent point - I've added R to the post.

chet,

I'm Australian, remember - "shit" doesn't even register as vulgarity in my uncouth culture. However, I'll try to bear the sensibilities of my more delicate readers in mind in future.