Oh shit, that's 320 terabytes! Dealing with data in a high-throughput age

By dgmacarthur on September 4, 2008.

Nature News has a special feature on "big data" - a broad look at the demands of the brave new world of massively high-throughput data generation, and the solutions adopted by research institutes and corporations to deal with those demands.

The image to the left (from an article in the feature by Boing Boing's Cory Doctorow) is a picture of the office door of Tony Cox, head of sequencing informatics at the Sanger Institute in Cambridge, UK. The 320 terabytes refers to the scale of the raw data being produced by the Sanger's next-generation sequencing machines as they chew through kilometres of DNA, including their share of the ambitious 1000 Genomes Project. (The article mistakenly attributes the 320 Tb number to a single run of a Solexa next-gen machine, whereas it actually refers to the data generated by several such machines over a period of time; still, the real numbers are pretty damn impressive.)

The article provides some insight into a dramatic shift in the landscape of human genetics: we are no longer seriously limited by our capacity to generate biological information, but rather by our ability to store, transport and analyse the obscene amounts of data generated by high-throughput techniques. Once upon a time, most biologists could safely manage their results with a few lab books and a basic spreadsheet. Today, even small labs are learning how to cope with gigabytes of image, gene expression and sequencing data. Over the next few years those demands will only increase as technology becomes cheaper, and the publishing imperative (or less cynically, sheer scientific curiosity) drives all of us towards larger and more complex data-sets.

That will result in a pretty steep learning curve for many bench biologists. Major sequencing facilities can afford to invest in things like 1,000 square metre server farms with a quarter left fallow for seamless technology upgrades, and they have the experienced staff to build and manage such resources to support their researchers. Most biologists in small labs, on the other hand, have little or no formal training in data management and analysis. Many of us have been forced to pick up computational skills on the fly, resulting in some innovative approaches (I still see biologists reformatting and analysing large data-sets using Word and Excel - it's amazing what some judicious cutting, pasting and find/replacing can do in the hands of a clever non-programmer) but often far-from-ideal outcomes, such as data loss and failures to take full advantage of rich experimental data.

Any readers currently in the early stages of a career in biology should take heed: develop the skills required to navigate large, complex data-sets and you'll be a hell of a lot more valuable to a potential lab head than if you were just another pipette-monkey (no offence intended to pipette-monkeys, of course; yours is an ancient and honourable profession, etc.). Even basic familiarity with a scripting language like Python or Perl and a statistical package like R will give you an edge by allowing you to automate tedious data entry and formatting tasks and make customised analysis tools; and if you end up as the go-to person in your lab for anyone with an informatic problem you can secure middle authorship on papers with minimal effort on your part - a neat trick for a young researcher.

For those of you not pursuing a career in genetics, the era of big data will still have its impact on you: the data now being generated by large-scale sequencing facilities, and the technologies used to generate them, will ultimately help to usher in truly predictive, personalised medicine. I'll be posting a lot more about this process over the next few months, so stay tuned.

Subscribe to Genetic Future.

More like this

Get your climate change data here: A big list of climate change data sources & repositories

We have a Steacie Library Hackfest coming up and our there this year is Making a Difference with Data. And what better area to make a difference in than the environment and climate change?

Open Data & The Panton Principles: Thoughts on a presentation to librarians

As I mentioned last week, on Tuesday, April 17 I was part of a workshop on Creative Commons our Scholarly Communications Committee put on for York library staff.

Nah, don't believe it

Around the Web: Some resources on the Panton Principles & open data

As part of a workshop on Creative Commons, I'm doing a short presentation on Open Data and The Panton Principles this week to various members of our staff. I thought I'd share some of the resources I've consulted during my preparations.

Don't leave out the language R. It's terribly useful and has a wonderful community (and it's free)

I cannot agree enough. As a former/present pipette-monkey and someone who wishes he had taken the time to learn something simple like perl, I wholeheartedly agree. It is unfortunate that now that I am a PI, I don't have the time to take an immersion course in these languages. I have to send my grad students and rely on them to write the scripts that I envision, then hope the scripts do what it is that I asked.

sir,

vulgarity is bad form in your titles. after all, your headline is displayed on the front page.

sincerely,
c.v. snicker

peter,

Excellent point - I've added R to the post.

chet,

I'm Australian, remember - "shit" doesn't even register as vulgarity in my uncouth culture. However, I'll try to bear the sensibilities of my more delicate readers in mind in future.

f**king criminal spawn :-)

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Genetic Future is moving

January 18, 2011

After a semi-hiatus due to various distractions, I'm about to restart blogging in earnest again over at the new home of Genetic Future on Wired Science. Please update your RSS feed: my new one is here. And a reminder: you can always keep track of new posts here as well as other nuggets of…

One more step towards the end of recessive diseases

January 13, 2011

In the last century infant mortality has declined precipitously in the Western world, thanks in large part to the development of antibiotics and vaccination. Yet as the suffering and death from infectious disease has reduced, the burden from genetic disease has become proportionately greater:…

New FireFox plugin for 23andMe customers

January 11, 2011

Software company 5AM Solutions has just launched a neat little FireFox plug-in for customers of consumer genomics company 23andMe. The idea is very simple: Download your raw data from 23andMe (or use one of the files from me or my colleagues at Genomes Unzipped); Install the…

Why you CAN have your $1000 genome - so long as you learn what to do with it

January 7, 2011

As part of his Gene Week celebration over at Forbes, Matthew Herper has a provocative post titled "Why you can't have your $1000 genome". In this post I'll explain why, while Herper's pessimism is absolutely justified for genomes produced in a medical setting, I'm confident that I'll be obtaining…

Bioscience Resource Project critique of modern genomics: a missed opportunity

December 15, 2010

Late last week I stumbled across a press release with an attention-grabbing headline ("The Causes of Common Diseases are Not Genetic Concludes a New Analysis") linking to a lengthy blog post at the Bioscience Resource Project, a website devoted to food and agriculture. The post, written by two…