A couple of weeks ago, I came across this discussion thread “Will you stop using 454?” It’s a pretty good thread–not much to disagree with there, although, from my perspective, it missed a key point (I’ll get to that).
But my answer is simple: I already have. My work focuses primarily on microbial genomics–that is, whole bacterial genomes. And 454 just isn’t getting it done. Before I get to that, let’s review very briefly how we assemble a genome (I’m simplifying greatly and leaving out a whole bunch of molecular biology and chemistry here–this is for the uninitiated).
We don’t actually sequence a genome–we sequence little pieces of it. With standard 454, the pieces wind up being about 350 – 450 base pairs (bp) long. To put this in perspective, a bacterial genome is typically two to five million bp. What we do is tile or ‘stack’ reads on each other and build a sequence. Something like this toy example:
AGCT
GCTC
CTCA
Becomes: AGCTCA (although we typically have much more ‘coverage’–many more reads confirming each base).
This works fine until you have repetitive content–identical (or nearly so) sequences that occur throughout the genome. Suppose our genome has unique sequences A, B, and C, with a repetitive region X between them. It’s impossible to figure out from tiling reads if we have A-X-B-X-C or A-X-C-X-B. We solve this by using what are called ‘jumps’*. These reads are composed of two regions of the genome that are a certain distance apart (lots of molecular biology is done here), enabling us to jump over the repetitive stuff. In the ABC example above, if we have jumps that contain parts of A and B, and parts of B and C, and no A and C jumps, we conclude the sequence is A-X-B-X-C.
There’s a catch though: the jumps have to be larger than your repetitive regions, otherwise you’re right back where you started. Which brings us back to 454.
If you use ‘standard’ 454 sequencing, with ~350-400 bp fragments and 3 kb (3,000 bp) jumps, the jump sizes are insufficient to span the repetitive regions typically found in bacteria . Some bacteria are riddled with these repetitive regions, and the ‘interesting genes’, such as antibiotic resistance genes and virulence genes, are flanked by these repetitive regions. So we’re often able to capture these genes, but we have no idea where they are found in the genome. That’s actually a critical biological and medical question: if these genes are found on plasmids–mini-chromosomes that can be transferred from bacterium to bacterium–that’s really important to know (many of the ‘problem’ resistance genes are found on plasmids).
Simply put, ‘a bag of genes’ genome is inadequate. While 454 can use 8 kb jumps, they’re ‘twitchy’ (high failure rate) and not amenable to high-throughput production. Compared to Illumina assemblies that use 5 kb jumps (and are thus able to span most of the repetitive regions), 454 isn’t even in the same league. Combined with the much lower cost of generating Illumina sequence data, 454 is really out of the bacterial genome game.
Does this mean 454 is ‘dead’? Perhaps not, as Nick Loman notes:
My understanding is that 454 long reads are for whole-genome shotgun only, not amplicon protocol. At least this is the case for the initial release. Which means they won’t be useful for 16S diversity studies which is a shame as that would be the biggest win for us (for accurate species level determination)….
454 long reads (i.e. 700 vs 500) may be useful for a limited number of de novo assembly projects but I don’t think they are critical, and it is certainly true that the cost is prohibitive for many users.
I could see 454 being useful if you need long reads and not that many of them (compared to Illumina). For human microbiome studies, 454 is still widely used for sequencing 16S genes (the 16S gene in bacteria can be thought of as a universal barcode that allows us to count bacteria without having to isolate the bacterium).
But for single bacterial genomes, Illumina is the near future (although who knows what PacBio holds?**).
*Technically, these are ‘paired ends’, but since they perform the same function as Illumina jumping libraries, I’m calling them jumps to hopefully reduce confusion.
**The major concern I have with PacBio de novo assemblies, and which could turn out to be completely unfounded, is that PacBio introduces a lot of indels, which makes identifying genes in a genome sequence difficult. If we assemble the genome completely but miss lots of genes due to these errors, is that an improvement? Will have to see data….