Over at The Tree of Life, Jonathan Eisen asks:
What do people think are the potential benefits that could come from finishing?
For those who don’t know what genome finishing is, I’ll let Eisen give the short summary:
Finishing: Using any combination of laboratory, computational and other analyses one can both fill in gaps in the assembly and improve the quality of the assembly. This can generally be called “finishing”
In the context of microbial genomes, here are some of my thoughts about finishing (italics orignal; boldface mine):
Whole genomes don’t come flying out of the sequencing machines: we have to take hundreds of thousands or millions of reads and stitch them together–what is known in genomics as assembly. It’s pretty easy and fast to get a pretty good genome. By pretty good, I mean that most of the genome (~99%) is assembled into pieces 50,000 – 1,500,000 bases long…. Where the assemblers get hung up on with bacteria are repeated elements–regions of the genome that are virtually identical (they don’t have to be completely identical, just close enough such that the assembler thinks they’re identical reads with sequencing errors). Because the assembler can’t figure out where to put these reads (they’re all identical), it discards them–that’s where the breaks occur…
This is a problem because some of the most interesting genes, such as antibiotic resistance genes, are found sandwiched between repeated elements, known as insertion sequence elements (‘IS elements’; IS elements are one of the major reasons resistance genes move from plasmid to plasmid–plasmids are mini-chromosomes that themselves can move from bacterium to bacterium–and from plasmid to chromosome). What this means is that we can assemble an antibiotic resistance gene (or genes) but we might not know if it’s found on a plasmid or on the chromosome–that’s a pretty critical biological question. To further complicate things, different plasmids can have the same IS elements, along with the bacterial chromosome. Not only will these introduce breaks into the assembly, but they can also lead to accidentally assembling plasmids together or incorrectly incorporating them into the genome.
Now, we do have methods to close up these gaps–this process is called finishing, and it involves either targeted sequencing or manually parsing through the existing data. But these are open-ended, slow processes (particularly the targeted sequencing). Worse, this involves thinking, and, relative to computer algorithms, thinking is very slow. This is also really expensive. So we can get a pretty good assembly, but I think a lot of people, thinking back to the Sanger sequencing days, when most bacterial genomes were closed, are going to have to understand that if you want a lot of genomes, they will be ‘pretty good’ assemblies, not closed, finished ones.
To return to Eisen’s question, I think finishing microbial genomes is important if you really have to localize genes to plasmids (or circularizing prophage). In infectious disease, that’s pretty important. However, from this perspective, finishing might become a moot point if the new technologies (454 pyrosequencing and Illumina) improve to the point where genes of interest can be reliably localized to plasmids*. Likewise, if you’re interested in the biology of repetitive elements, you’ll need finished genomes.
So, regarding finishing, I think in about a year, we’ll have very little need for complete finishing, unless the biological question requires it (e.g., repetitive elements).
*To get technical, as long as I can link a gene to a plasmid scaffold–a set of smaller sequences that I know are tied together, even though I lack some of the intervening regions–I’m happy.