A useful guide for the bioinformatics tool builders

I often get questions about bioinformatics, bioinformatics jobs and career paths. Most of the questions reflect a general sense of confusion between creating bioinformatics resources and using them. Bioinformatics is unique in this sense. No one confuses writing a package like Photoshop with being a photographer, yet for some odd reason, people seem to expect this of biologists. In the same respect, even the programmers and database administrators who work in bioinformatics, are unfairly assumed to have had graduate level training in biology.

In many ways, it's easiest to understand what bioinformatics is, and to choose a bioinformatics-related career, by dividing the field's participants into two groups: the tool builders and the tool users. The tool builders are the programmers, architects, computational biologists, and computer scientists who write new algorithms, create databases, and build software systems. The tool users are the biologists.

As far as careers go, job descriptions that include "bioinformaticist" or "bioinformatics programmer" usually apply to the tool builders. The jobs where people use bioinformatics are more biology-related. A wider variety of careers use bioinformatics resources, but that term won't appear in the job title and the people using the tools might not even know they're using bioinformatics - especially in the case of databases like PubMed.

The kind of bioinformatics I teach is directed towards the tool users, either technicians, wet-bench biologists, or like me, biologists who've gone digital. I teach instructors and students how to use the tools. In my classes, we use bioinformatics resources to learn about biology.

Nevertheless, I would like to have good answers for the future bioinformaticists and instructors who ask me what kinds of languages and subjects to study or teach. Working for a few years in a software company has taught me something about the activities in both the builder and user camps, but it's nice to have a more detailed and comprehensive reference to cite.

ResearchBlogging.orgThat's why I was really happy to read this article in PLoS Computational Biology: "A Quick Guide for Developing Bioinformatics Programming Skills." by Joel Dudley and Atul Butte (1).

This is the article I will recommend to students on the tool-building path and instructors who wish to help them.

My favorite parts were the sections on UNIX skills (I love UNIX!), structuring data, and on valuing your time. My only complaints are minor. I thought the comment on SQL statements being peculiar was puzzling. It would also have been nice to see some discussion of HDF5 and BioHDF. This topic would have fit well in both the structuring data and valuing your time sections. BioHDF supports rapid development because it has a hierarchical data model, binary file format, and collection of APIs (2). (BioHDF is an open-source collaboration between The HDF group and Geospiza. You can read more about it here, at www.biohdf.org, and in March in Advances in Computational Biology, part of the book series Advances in Experimental Medicine and Biology, AEMB, published by Springer).

The best part of the article, though, is the authors get it. They understand what biologists want.

Quoting from the PLoS article:

The success of bioinformatics software is based not on the elegance of the software design, but rather its utility as a tool for driving and answering biological questions. Consequently it is no surprise that many successful bioinformatics apps are written by biologists who lack formal computer science training, as they undoubtedly put scientific utility ahead of architectural elegance and completeness.

This is an important point for aspiring bioinformaticists to remember.

References:
1. Dudley, J., & Butte, A. (2009). A Quick Guide for Developing Effective Bioinformatics Programming Skills PLoS Computational Biology, 5 (12) DOI: 10.1371/journal.pcbi.1000589

2. Mason, C. et. al. Standardizing the Next Generation of
Bioinformatics Software Development With BioHDF (HDF5) in Advances in Computational Biology, Springer (in press).

Categories

More like this

The importance of any software is the utility to the area for which it was written. That is one of the biggest issues with Computer Science graduates. They know (sort of) how to build solutions, but they don't know the problems that need to be solved.

By RogerTheGeek (not verified) on 30 Dec 2009 #permalink

Good point on the distinction between being a bioinformatician and using bioinformatics tools. One thing that I have noticed is ease of use of a tool is a major factor in whether or not it will find wide adoption with biologists. Tools with a web interface are much preferred. GUI tools will score a lot above command line tools even if the command line tools are better or more powerful. In general, most are not comfortable with tools that are not available as binaries.

Good points, but I'd be careful not to over-emphasize the lack of training in computer science. For some critical applications, you have to have a deep understanding of algorithms and data structures - I'm thinking in particular of genome assembly and of sequence alignment, especially (and recently) short-read alignment.

Not to put too fine a point on it, but my student Ben Langmead could not have come up with the remarkably efficient short-read aligner Bowtie without first knowing about the Burroughs-Wheeler Transform, an innovative data structure and algorithm that was developed in the data compression field within computer science. Bowtie allows you to align the output of an Illumina run (50 million reads or more) to the human genome in just a couple of hours on a desktop PC. At the time of its release, the best competing software took 2 days or longer for the same thing. And the competitors soon re-implemented their algorithms using the B-W transform.

So sure, for quick and dirty solutions, computer science training is not usually necessary. (We had some outstanding bioinformatics programmers at TIGR, where I used to work, who came out of biology backgrounds.) But for those students who want to make major advances in bioinformatics, I would tell them to get some serious CS education.

Steven

Thanks Roger, Farhat, and Steven!

You've reinforced my points well. There is a difference between the skills and training needed for the people who design the tools and the people who use them, and "bioinformatics" as it's commonly used these days, mostly describes the process of designing and making the tools.

Steven: I agree that computer science is important for the tool builders. I don't think the PloS article disputes that point. Overall, the best-working tools come from collaborations between people with many different kinds of skills. The biologists know what questions they want answered, and the computer scientists and programmers know how to make tools that answer the questions in a reasonable length of time. I'm biased about this, a bit, since I think some of the coolest tools coming out of Geospiza right now result from a great collaboration between biologists and computer scientists.

I also think BowTie is pretty cool. I have used it myself since BowTie is one of the algorithms that Geospiza employs in one of it's Next Generation sequencing analysis pipelines. Luckily for me, being a biologist, I can upload my data to Geospiza's system, choose a reference data set, and click a button to run either the Burroughs Wheeler algorithm or BowTie.

I wrote about using the BWA last fall in Current Protocols in Bioinformatics.

The second paper (Mason, et. al., referenced above) mentions a prototype where BowTie is integrated with BioHDF. This allows data to be processed and retrieved for viewing more efficiently and quickly than with some other systems (SAM / BAM).

And, the reason this works so well, I think, is because of Geospiza has been collaborating with the The HDF group to use a system -HDF- that was designed by computer scientists to efficiently handle large-scale scientific data.

While I agree with almost everything you wrote,I'm not so sure the labels are correct or even entirely relevant. As Roger pointed out, there are CS students (and degreed professionals) writing huge amounts of code that actually are more exercises in clever algorithm development than producing biologically relevant results. How many papers are published showing a new method gives similar results to another method? I love that PLOS quote!

1. We wouldn't have some of the superb tools we have without Computer Scientists. But the best ones also know biology, or work in a group focused on solving biological problems.

2. Bioinformaticians coming from the biology domain need to haev a strong working knowledge of CS, meaning systems, algorithms, and software development, otherwise they are just highly educated tool users. Not that there isn't a place for people like this, but it's not bioinformatics.

3. The best bioinformaticians have the biggest skill sets, meaning being able to understand the biological problem AND having the tool kit to be able to attack the problem in a logical and meaningful way. The more you know about both domains the better. Especially when you have to serve as an intermediary between both camps!

I left the bench in 1993 to go into bioinformatics (we called it computational biology then, but that's a different topic), and my favorite quote to explain what I do to bench scientists is that I traded pipettes and centrifuges for software and computers.

Can someone please give me an example of a widely used bioinformatics software package that has been written by a "real" biologist? Many bioinformaticians and programmers in general have a training in other disciplines. I doubt that people that are doing wetlab work are very successful in writing software that is usable in the medium term. I would appreciate an example.

I do a lot of programming and I find it very difficult to write a really working program that I still understand after a year. Not because it is very complex, but it takes a lot of time and effort to write something properly. I couldn't do experimental work at the same time. If I did, my software would just be throwaway spaghetti code, like most of the biologist-written code that I've come across. They serve the purpose but only for one user.

There's one called "BLAST" that I hear is quite popular ;-)

I commented on the Dudley & Butte paper a couple of weeks ago [1]. Atul Butte was kind enough to respond, saying that, "...the ideal scenario is when one's research projects enable one to learn these skills, so that these skills get learned in a practical way outside the classroom too, while doing science." My reply [2] was, "Yeah, but that doesn't work." The good news is, twenty hours of focused training, supplemented by lab sessions, does work [3]; the bad news is, the people who need it most are also least likely to go looking for help (or to fund it) [4].

[1] http://softwarecarpentry.wordpress.com/2009/12/27/dudley-and-butte-on-s…

[2] http://softwarecarpentry.wordpress.com/2009/12/30/osmosis-is-just-a-fan…

[3] http://software-carpentry.org and http://softwarecarpentry.wordpress.com

[4] http://www.cs.utoronto.ca/~gvwilson/articles/cise-will-not-learn-2008.p…

nice slide on the bioinformatics tool Can you give me more information about UNIX .actually i want to know how it is link with bioinformatics .

By bioinformatics… (not verified) on 13 Sep 2010 #permalink

You might as well ask "what is the link between bioinformatics and computers?"