Bio::Blogs 4

Welcome to the fourth edition of Bio::Blogs!

This is the carnival where we explore topics at the intersection of computing, biology, and sometimes a bit human behavior.

In this edition, we consider issues with annotation, agonize over standards, explore the question of whether or not it's possible to tame those wild and wooly computational biologists and make them laugh their way into writing programs that other people can use, give the Perl fans something to do while waiting for that program to run, and much, much, more.

Today, we'll begin on the biology side of spectrum and work our way over into computing.

Do try this in the classroom, or better yet, at home!
i-0b0d43f2f67b1960d2e105f54e9a7ba1-fossil_hominin.gifChris Cotsapas submitted a truly excellent post from Nick Matzke at the Panda's Thumb that describes a wonderful activity and data set that students can use to look human evolution by graphing the differences between fossils. In Fun with Hominin Cranial Capacity Datasets (Excel), Matzke talks about hunting through the wilds of PubMed to uncover the elusive records of brain size in fossils. I've pasted a smaller version of his graph here, since Matzke gives permission for educational use (you are learning something, right?), but do check out the original graph - or better yet, graph it yourself, since Matzke gives a link to the original data sets.

i-2db0c129d0b2c0aa7fffec9b3c8339e3-txn.pngPedro Beltrao, from Public Rambling, and one of the fathers of this carnival, shares some interesting ideas about the evolution of transcription factors and the DNA sites where they bind in Evolution of transcription networks. As Mary-Claire King and Allan Wilson proposed, long ago, a little change in regulation can go a long, long way. Pedro observes, in looking at the data, that changes in a single base, in a site where transcription factors bind, are far more common than changes in the transcription factors themselves. This makes sense, since if you changed the specificity of a transcription factor (the protein), you would have more of a global effect (since the same transcription factor binding site is dispersed throughout the genome). Changing a single base in a single copy of a transcription factor binding site, located near a single-gene, would be predicted to have a more subtle, and probably less detrimental, effect on phenotype, so I think we would expect to changes in individual binding sites occur more often.

As long as we're considering DNA sequences, let's take a look at Neil Saunders' issues with databases and genome annotations. In an entertaining and thought-provoking article, Genome annotation: who's responsible?, Neil struggles through the Sargasso Sea environmental sequence data and other sections of GenBank in search of DNA sequences for 23S ribosomal RNAs and a protein sequence for monomethylamine methyltransferase. These journeys lead to a lament about the lack of community-wide standards in genome annotation, and some suggestions for improvements. To me, they also emphasize the importance of being able to retrieve and work with the sequence data itself. I don't think the semantic web is going to solve Neil's sort of problem.

Neil's analysis reminds me of an instructive paper by Micheal Galperin and Eugene Koonin, that's worth reading, even if it's almost ten years old, on systematic errors in genome annotation. I heard Galperin give a really funny talk on genome annotation bloopers. He talked about the curious puzzles in biology that can arise from gene annotation. In one case, a gene name became truncated and changed from a "phage head protein" to a "head protein." Now, when biologists think of the word "head," we envision the anatomical structure that contains a brain and sits on top of an animal body, certainly not a brainless virus.

You're right Neil, we do have a long way to go.

Chris Cotsapas, from Fourth Floor Studio, adds to the call for annotation standards in his post Phenotype: the new standards war?. Although I would disagree with Chris's comment that the HapMap project has collected the vast majority of common genetic variations (I think they've only gotten 1%, but I have to check on this), I do agree that developing standards for describing phenotypes is hard and sometimes contentious. Standards will certainly benefit the community, but getting there will involve many arguments about minutia and will no doubt be fraught with pain.

Since we've been discussing human behavior and it's impact on the problem on annotation standards, it seems like a good time to look at infrastructure. Let's face it, IT is expensive. Between 1985 and 2002, UPS spent over 17 $billion on information technology. They continue to spend over $1 billion per year, or 11% of their budget, so you can track your package. I suspect that if we were to add up all the person hours that biologists (like Neil and I) spend sitting at our computers searching for information, we'd find that scientific endeavors spend much, much more (in terms of time) and probably get much, much less.

In Scientific Software and in her paper in PLOS, "Scientific Software Development is not an Oxymoron," smeutaw writes about minimizing the pain by luring computational biologists into adopting ideas from software engineering, and maybe even taking classes in Software Carpentry. This blog post hit home since I work at a company that creates scientific software, so we encounter these issues on a daily basis. smeutaw does a nice job of describing some of the reasons behind Neil's frustrations, i.e. the lack of funding for activities like upgrading infrastructure, porting code to new systems, or keeping up with security concerns.

And speaking of infrastructure, we have a post from mndoci, on Utility computing, web applications and computational science. In this post, he discusses a product called AppLogic, which may help SaaS programmers and help "lower the barrier to scientific developers." Hope you're getting settled in and enjoying Capitol Hill, D.S.!

Last, we have one more treat from Neil Saunders, this time it's for Perl developers. Have you ever started running a program only to wonder if there's time to get a cup of coffee before it's completed? Wouldn't it be helpful to have a little program that would calculate the size of the computing problem and give you an estimate of the time that it needs? In this post, Neil shares a perl script for making a progress bar, Term::ProgressBar that will certainly help you watch as time goes by.

Until next time, "Here's looking at you, kid!"

Next month's Bio:Blogs will be hosted by Chris Cotsapas at the Fourth Floor Studio.

technorati tags: , , , ,
,

Categories

More like this

Although I would disagree with Chris's comment that the HapMap project has collected the vast majority of common genetic variations (I think they've only gotten 1%, but I have to check on this)

it's certainly more than 1% (in phase II), but it's difficult to know, especially for regions of the genome without deep resequencing. in this paper, 5 out of 22 common SNPs found in resequencing were in the hapmap. how representative that figure is would be tough to predict.

Could you be thinking of the Encode project for the 1% figure?

I agree that it's hard to know; I'd also point out that recent efforts showing how common structural variations (insertion/deletion/duplication events) are redefining (again!) the way we think of variation. However, it's probably a good bet that many common (ie >5% frequency) SNP variants have been identified: ~3-4M in the HapMap, depending on the filtering criteria. It remains to be seen whether the assumption that SNPs make up the bulk of polymorphisms is true.

The main limitation, of course, is that the number of samples used is small (270 total), and drawn from three geographic locations, which may lead to ascertainment bias.