The Future of Bacterial Genomics: It's Not the Sequencing, It's the...

By mikethemadbiologist on May 27, 2010.

...assembly and analysis. From the depths of the Mad Biologist's Archives comes this post.

The Wellcome Trust has a very good (and mostly accurate) article about the 'next-gen' sequencing technologies. I'm going to focus on bacterial genomics because humans are boring (seriously, compared to two bacteria in the same species, once you've seen one human genome, you've seen them all).

Most of the time, when you read articles about sequencing, they focus on the actual production of raw sequence data (i.e., 'reads'). But that's not the rate-limiting step. That is, we have now reached the point where working with the data we generate is far more time-consuming.

Whole genomes don't come flying out of the sequencing machines: we have to take hundreds of thousands or millions of reads and stitch them together--what is known in genomics as assembly. It's pretty easy and fast to get a pretty good genome. By pretty good, I mean that most of the genome (~99%) is assembled into pieces 50,000 - 1,500,000 bases long*. Where the assemblers get hung up on with bacteria are repeated elements--regions of the genome that are virtually identical (they don't have to be completely identical, just close enough such that the assembler thinks they're identical reads with sequencing errors). Because the assembler can't figure out where to put these reads (they're all identical), it discards them--that's where the breaks occur*.

This is a problem because some of the most interesting genes, such as antibiotic resistance genes, are found sandwiched between repeated elements, known as insertion sequence elements ('IS elements'; IS elements are one of the major reasons resistance genes move from plasmid to plasmid--plasmids are mini-chromosomes that themselves can move from bacterium to bacterium--and from plasmid to chromosome). What this means is that we can assemble an antibiotic resistance gene (or genes) but we might not know if it's found on a plasmid or on the chromosome--that's a pretty critical biological question. To further complicate things, different plasmids can have the same IS elements, along with the bacterial chromosome. Not only will these introduce breaks into the assembly, but they can also lead to accidentally assembling plasmids together or incorrectly incorporating them into the genome.

Now, we do have methods to close up these gaps--this process is called finishing, and it involves either targeted sequencing or manually parsing through the existing data. But these are open-ended, slow processes (particularly the targeted sequencing). Worse, this involves thinking, and, relative to computer algorithms, thinking is very slow. This is also really expensive. So we can get a pretty good assembly, but I think a lot of people, thinking back to the Sanger sequencing days, when most bacterial genomes were closed, are going to have to understand that if you want a lot of genomes, they will be 'pretty good' assemblies, not closed, finished ones.

The other area is annotation: now that you have a bunch of sequences, you would like to know what genes are found on those sequences. This involves two things: identifying the open reading frame ('ORF') of the gene (that is, which nucleotides encode proteins), and then identifying what that open reading frame encodes (I'm making this sound like a two-step process; it's actually an iterative process, where each step informs the other).

Here too, we have automated gene callers which are very fast. Actually, many different gene calling methods. That's good! However, they will disagree with each about five to ten percent of the time. By disagree, I don't just mean that two different methods call the same exact region a different protein (e.g., an aldolase versus a dehydrogenase). We could cope with that for a lot of the downstream analyses we do, as long as we have identified the protein correctly**. The problem really arises when two different, overlapping regions of sequence are identified as ORFs (e.g., program A calls nucleotides 1-300 as a gene, and B calls nucleotides 13-360 as a gene). That is not good, because then a human has to go through the output manually and figure out what the actual ORF is (requiring more thinking which is slow and expensive). I would note that most major sequencing centers do manual annotation, but it is slow.

So, from a bacterial perspective, genome sequencing is really cheap and fast--in about a year, I conservatively estimate (very conservatively) that the cost of sequencing a bacterial genome could drop to about $1,500 (currently, commercial companies will do a high-quality draft for around $5,000- $6,000). We are entering an era where the time and money costs won't be focused on raw sequence generation, but on the informatics needed to build high-quality genomes with those data.

Interesting times.

*There are other technical reasons why breaks occur, but, to me, this is the worst offender.

More like this

Those pesky insertion elements and their repetitive flanking sequences may have a lot more significance than simply messing up sequencing efforts. Obviously, they can inactivate a gene when inserted into an ORF, which impacts evolution. However, this process may be more than just an intellectual curiosity. It could be the driving force behind rapid evolution (think Cambrian explosion) especially if it is coupled with hierarchical endosymbiosis.

the blood and bodies of the atheist movement...

they tried to BULLDOZE the entire METAPHYSICAL DIMENSION...

they LOST THE WAR...

you have FORFEIT YOUR SOUL, shermer... you have become an object in the material world, as you WISHED...

youtube.com/watch?v=eUB4j0n2UDU&feature=player_embedded

farm1.static.flickr.com/7/11792994_ffaaee87fa.jpg

we're gonna smash that TV...

They had become ENEMIES OF THE PEOPLE AND OF GOD...
you pushed too much and *CROSSED THE LINE*

degenerates (PZ) or children (HEMANT) - ATHEISTS!

youtube.com/watch?v=bRRg2tWGDSY

do you have anything to say, you STUPID LITTLE F*CKER?

how about I tell you, Mr. Shermer, EVERYTHING YOU THINK ABOUT THE WORLD is

*WRONG*

THE BOOBQUAKE - 911!

dissidentphilosophy.lifediscussion.net/philosophy-f1/the-boobquake-911-t1310.htm

youtube.com/watch?v=sx7XNb3Q9Ek&feature=related

RUN, ATHEISTS, RUN!!!

Uh....what?!?

Don't pay attention to the idiot, he's polluting atheist blogs and boards all over the web now.

What good review articles are there about the problem of strucuring these data? Preferably accessible for people with no detailed biological knowledge if possible. I have a mathematical/physics background and this stuff interests me.

LOL

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the AmazonWhat do accommodationists do about creationist politicians?I've Been Told You Can Get Flu From the Flu Shot: False!Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His Investigators…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report"XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS!Coulter Goes All Science-y in Bid to Disprove EvolutionYet another bad day for the anti-vaccine movement 2011Antibiotics: Killing Off…