"Will You Stop Using 454?" Um, I Already Kinda Did

A couple of weeks ago, I came across this discussion thread "Will you stop using 454?" It's a pretty good thread--not much to disagree with there, although, from my perspective, it missed a key point (I'll get to that).

But my answer is simple: I already have. My work focuses primarily on microbial genomics--that is, whole bacterial genomes. And 454 just isn't getting it done. Before I get to that, let's review very briefly how we assemble a genome (I'm simplifying greatly and leaving out a whole bunch of molecular biology and chemistry here--this is for the uninitiated).

We don't actually sequence a genome--we sequence little pieces of it. With standard 454, the pieces wind up being about 350 - 450 base pairs (bp) long. To put this in perspective, a bacterial genome is typically two to five million bp. What we do is tile or 'stack' reads on each other and build a sequence. Something like this toy example:

AGCT
GCTC
CTCA

Becomes: AGCTCA (although we typically have much more 'coverage'--many more reads confirming each base).

This works fine until you have repetitive content--identical (or nearly so) sequences that occur throughout the genome. Suppose our genome has unique sequences A, B, and C, with a repetitive region X between them. It's impossible to figure out from tiling reads if we have A-X-B-X-C or A-X-C-X-B. We solve this by using what are called 'jumps'*. These reads are composed of two regions of the genome that are a certain distance apart (lots of molecular biology is done here), enabling us to jump over the repetitive stuff. In the ABC example above, if we have jumps that contain parts of A and B, and parts of B and C, and no A and C jumps, we conclude the sequence is A-X-B-X-C.

There's a catch though: the jumps have to be larger than your repetitive regions, otherwise you're right back where you started. Which brings us back to 454.

If you use 'standard' 454 sequencing, with ~350-400 bp fragments and 3 kb (3,000 bp) jumps, the jump sizes are insufficient to span the repetitive regions typically found in bacteria . Some bacteria are riddled with these repetitive regions, and the 'interesting genes', such as antibiotic resistance genes and virulence genes, are flanked by these repetitive regions. So we're often able to capture these genes, but we have no idea where they are found in the genome. That's actually a critical biological and medical question: if these genes are found on plasmids--mini-chromosomes that can be transferred from bacterium to bacterium--that's really important to know (many of the 'problem' resistance genes are found on plasmids).

Simply put, 'a bag of genes' genome is inadequate. While 454 can use 8 kb jumps, they're 'twitchy' (high failure rate) and not amenable to high-throughput production. Compared to Illumina assemblies that use 5 kb jumps (and are thus able to span most of the repetitive regions), 454 isn't even in the same league. Combined with the much lower cost of generating Illumina sequence data, 454 is really out of the bacterial genome game.

Does this mean 454 is 'dead'? Perhaps not, as Nick Loman notes:

My understanding is that 454 long reads are for whole-genome shotgun only, not amplicon protocol. At least this is the case for the initial release. Which means they won't be useful for 16S diversity studies which is a shame as that would be the biggest win for us (for accurate species level determination)....

454 long reads (i.e. 700 vs 500) may be useful for a limited number of de novo assembly projects but I don't think they are critical, and it is certainly true that the cost is prohibitive for many users.

I could see 454 being useful if you need long reads and not that many of them (compared to Illumina). For human microbiome studies, 454 is still widely used for sequencing 16S genes (the 16S gene in bacteria can be thought of as a universal barcode that allows us to count bacteria without having to isolate the bacterium).

But for single bacterial genomes, Illumina is the near future (although who knows what PacBio holds?**).

*Technically, these are 'paired ends', but since they perform the same function as Illumina jumping libraries, I'm calling them jumps to hopefully reduce confusion.

**The major concern I have with PacBio de novo assemblies, and which could turn out to be completely unfounded, is that PacBio introduces a lot of indels, which makes identifying genes in a genome sequence difficult. If we assemble the genome completely but miss lots of genes due to these errors, is that an improvement? Will have to see data....

More like this

Hi Mike

I tend to agree with your conclusion. Although I would say we've had success in the past with 454 Titanium combined with 8kb libraries to produce single scaffold assemblies, but it is both expensive and labour intensive.

What is interesting is that Ion Torrent is going to give you very similar data to 454 FLX (at least to start off with, and so the same issues of dealing with draft, fragmented genomes will arise).

It might also be worth noting that 5kb Illumina mate-pair is not yet trivial on the wet-side yet and also has issues with variability.

PacBio sounds very promising but cost of machine will be prohibitive so we'll have to get our data from service providers.

We sure do live in interesting times!

Cheers

Nick.

Hi,
Our company, BlackBio, in Spain, has solved that question, the universal identification of bacteria by genome sequencing, using a simple and cheaper method based on pyrosequencing of Qiagen (formely from Biotage) improved by some patented enhacers, sequencing three little fragments of 16S and crossing sequence information. You can get more info at www.blackbio.eu. The platform is validated and tested in some hospitals in Europe and Middle East.

I think machines for HT-sequencing, like 454, Ion Torrent, PacBio, don't solve (maybe never will solve) clinical problems needed of simple, faster, easier and cheaper solutions

Cheers

Pedro

zan Ä°pek Trabzonsporâa Mı Gidiyor,ozan ipek trabzonspor görüÅmesi,bursasporlu volkan trabzonspor,bursalı volkan trabzondan kanca,trabzonspor volkan ipeÄi istiyor,ozan ipek trabzonspor yolunda,ozen ipek trabzonspora gidecek mi,ozan ipek trabzonspora geliyor mu

Ayıüzümü (itüzümü) : Fundagillerden; küçük taneler halinde kırmızı renkli yemiÅleri olan, tüylü bir bitkidir.1-3 metre yüksekliÄindedir. Her mevsimde yaprakları vardır. Makilerde bulunur. Dalları kırmızımtırak kahverengidir. Yaprakları ÅimÅir yapraklarına benzer. İçinde Hydrochinone vardır. Sonbahar aylarında toplanıp kurutulur. Ãiçekleri pembe salkımlar halindedir. Ev ilaçlarında yaprakları kullanılır.

Interesting that you complain about PacBio indels, yet don't mention the 454 indels!

454 has a well-known problem with homopolymer regions, i.e. stretches of 2 or more of the same base in a row, which it often counts as being one base longer or shorter. With 454, this is a *systematic* error, so resequencing the same region multiple times won't necessarily get rid of those errors. This causes frameshifts and incorrect gene annotations, and a huge headache for both genomic and metagenomics studies.

Pacbio admittedly has more errors, but they seem to be randomly distributed. So contrary to 454 indel errors, they average out very quickly as you increase the read coverage.

Suppose our genome has unique sequences A, B, and C

so being of the un-initiated :D .. this might confuse some people

specifically that were talking nucleotides .. and then 'place holders' for sequences