Genetic Future

Pushkarev, D., Neff, N., & Quake, S. (2009). Single-molecule sequencing of an individual human genome Nature Biotechnology DOI: 10.1038/nbt.1561



There is a new twist, though: this is the first genome to be sequenced using single molecule sequencing technology – also known as “third-generation” sequencing, to distinguish it from first-generation Sanger sequencing, and from the newer second-generation platforms 454, Illumina and SOLiD that have been responsible for seven of the eight individual genomes published so far*. 
The technology in question is the Heliscope, brought to you by Helicos BioSciences; and the genome in question belongs to Helicos co-founder Stephen Quake.
Single molecule sequencing is clearly the future of genome analysis, so this should be an exciting announcement – but while this paper is a promising taste of things to come, the genome sequence itself is in many ways a disappointment. Let’s take a look at what Helicos have achieved, and at just how far the company has to go before it can hope to compete with established second-gen platforms.

The challenges: short reads and a high error rate
Let’s start with some numbers. Like both Illumina and SOLiD, the HeliScope generates DNA sequence data as a massive collection of very short reads – but while the Illumina platform is now routinely generating reads over 100 bases long, the HeliScope generates reads on average just 32 bases long, with only a tiny fraction exceeding 50 bases in length. In fact, the reads are deliberately filtered to exclude any extending for over 70 bases, as these are highly enriched for technical artefacts.

Stitching together a genome sequence with such short reads is a substantial challenge, especially in regions where the sequence is repetitive – and indeed the technology can only cover 90% of the reference genome compared to 99.9% for a genome recently sequenced to similar depth with Illumina.

To be fair, Illumina achieves this in part by generating reads in non-independent pairs separated by a known distance (so-called paired-end reads), which are possible to generate on the HeliScope but weren’t used in this study, which was performed six months ago. Clearly genomic coverage will already have improved as Helicos bring paired-end runs online.
The short read length of the HeliScope limits its application, but the most worrying problem with the technology is its error rate: 3.6% of the bases in its raw reads are wrong, a substantially higher error rate than current second-generation platforms. The high error rate results largely from so-called “dark bases” – bases that don’t produce the fluorescent signal the HeliScope requires to read a sequence – which result in an apparent deletion in the read.
As a result of the short reads and high error rate, the Helicos team had to throw away 37% of the reads they generated since they couldn’t be effectively mapped to the reference genome. 
Calling genetic variants
Despite the challenges of mapping their short, error-prone reads, the team generated enough reads to cover the mappable 90% of the genome an average of 28 times per base, and that level of coverage (comparable to the depth seen in recent Illumina-based papers) meant that the errors in their raw reads could be largely cancelled out by the addition of more reads in the same place.
As a result of this depth of coverage and the generally low rate of base-swapping errors (as opposed to deletion errors), their accuracy for calls of single-base variants (SNPs) seems quite reasonable. They could call 97% of SNPs with 99% accuracy, which is still worse that second-generation approaches but not terrible for a rough draft genome.
However, the potential for the HeliScope to call small insertion/deletion variants remains untested – the authors didn’t even attempt it here, and I can only assume that it will be non-trivially complicated by the prominence of deletion errors in the reads. Calls for larger insertion/deletions (copy number variants, or CNVs) are seriously restricted by the techniques’  inability to extend into repetitive regions – the very same regions that are most enriched for these important variations.
Democratising genomics?
In the media flurry around this article (see links below), Quake and his team appear to be pushing the line that the HeliScope is a feasible alternative to established second-generation platforms for smaller labs:

“This is the first demonstration that you don’t need a genome center to sequence a human genome,” Quake said in a statement. “This can now be done in one lab, with one machine, at a modest cost.” [GenomeWeb]

In the supplementary information the authors go so far as to compare the size of the author list in their study (a genuinely remarkable number: three) with previous published genomes (e.g. 196 authors for the first Illumina genome), apparently to demonstrate that the HeliScope takes less effort to run than its competitors – in the table legend they state that “the number of authors is an estimate of labor”. 
This is rather silly, of course: the length of an author list on a genome paper has no necessary correlation with the ease of operating a technology. In Kevin Davies’ excellent article on the announcement in Bio-IT World, Clive Brown from third-generation competitor Oxford Nanopore has a trenchant response:

Brown, who was formerly with Solexa and Illumina, said it was misleading to compare the three co-authors on the Stanford paper with the 250 or so on the landmark 2008 Illumina publication in Nature on the first African genome, because “that paper was the culmination of eight years work.” He noted that an earlier 2008 Helicos publication had more than 20 co-authors to sequence a tiny viral genome.

(As an aside, in the same article Brown also delivers an entertainingly back-handed compliment on the Helicos technology: “They’ve stuck with it, and they’ve made it work about as good as it can work with single-molecule fluorescence and the camera they have. [...] That’s not trivial.”)
It’s unclear to me that the work involved in generating data on the HeliScope is actually that much less than that involved in using Illumina or SOLiD machines. Certainly the cost difference in terms of reagents is marginal at best; the authors estimate that this genome cost them $48,000 in reagents, which is exactly the price that Illumina is now offering for a retail genome sequence, and over twice the price that Complete Genomics is currently charging genomics facilities. And given the non-trivial up-front cost of a HeliScope – close to a million dollars, last I heard – this is hardly an infrastructure investment that most small labs will be able to consider in the near future.
One final point here: one of the requirements of next-gen sequencing that is frequently under-played is the need for informatics support and infrastructure. Very few small labs are equipped to deal with the sudden influx of terabytes of short-read sequence data; most lack both the hardware and the expertise to cope with such an onslaught. If Helicos or any other next-gen sequencer is to push into the small lab market it will need to invest heavily in the provision of powerful hardware and extremely user-friendly software to potential customers, to ensure that the people who receive their machines don’t find themselves completely unable to do anything with the resulting data.
Where to now?
This paper sets the bar pretty low for other third-generation sequencing contenders: it appears that formal entry into the human genome sequencing race merely requires generating a genome sequence of the standard that second-generation sequencers were achieving in early 2008, at the same price that they’re charging right now. That’s a fairly uninspiring goal.
I anticipate more exciting offerings in the near future from other third-gen providers such as Pacific Biosciences and Oxford Nanopore (long-term readers will know that I am a particular fan of Oxford Nanopore’s approach). The long-read, single-molecule approaches being developed by these companies will have a massive impact on the completeness and accuracy of human genome sequencing once they achieve the necessary cost and throughput milestones.
Basically, stay tuned: single molecule sequencing is the future, but the future isn’t quite here yet.
Links for further reading
* For an excellent summary of second-generation sequencing see this article on the Wellcome Trust website by Mun-Keat Looi.

Comments

  1. #1 Kevin Davies
    August 10, 2009

    Daniel,
    Excellent summary as always. While this doesn’t set the bar so high, neither did 454 when they debuted 2nd-gen sequencing in 2005.
    I’m just impressed (and pleased) how far Helicos has come in a short time. Do you remember at AGBT last February, John Todd asked Bill Efcavitch why the HeliScope cost so damn much? Efcavitch replied tersely: “We’re still cheaper than the Large Hadron Collider!”
    Quake told me over the weekend they already have three cancer genomes sequenced and being analyzed, so we should know soon if they can raise that bar.

  2. #2 Daniel MacArthur
    August 11, 2009

    Hey Kevin,

    Fair point about the debut of 454 (although I’m guessing Helicos is desperately hoping it won’t end up matching the historical precedent of 454 too closely, i.e. getting there first before being eclipsed by substantially better technologies).

    It will be interesting to see what they get out of the cancer genomes, particularly if they’ve applied paired-end in those cases – with their current single-end approach they’ll miss a lot of the more complex structural rearrangements you get in cancer cells.

    I agree that it’s good to see the progress that Helicos has made, but I’m still pretty skeptical about the long-term prospects of this platform. As I understand it the optics of the system place some pretty fundamental limits on it, and I suspect the need for architectural changes to support a 900-kilogram machine will put a lot of labs off. We’ll see…

  3. #3 Edward Winstead
    August 11, 2009

    Daniel,

    Thank you for this fascinating overview.
    Some of your readers may be interested to know that the genome of a second cancer patient has now been sequenced by the same Wash U group that did the first. The results were reported in the New England Journal of Medicine August 5.

    http://www.ncbi.nlm.nih.gov/pubmed/19657110

    Edward Winstead

  4. #4 Daniel MacArthur
    August 11, 2009

    Hi Edward,

    Oops… the second cancer patient completely slipped my mind. I’ve added him to the list in the first paragraph – thanks for the reminder!

  5. #5 Alejandro
    August 13, 2009

    Hi Daniel,

    Nice post. Very comprehensive and critical.
    I selected your post as one of my “picks of the week” in molecular biology over at my blog (http://amontenegro.blogspot.com).
    Cheers,
    -A

  6. #6 Keith Robison
    August 15, 2009

    One other point about the Helioscope worth noting, though sequencing a typical genome doesn’t show it off. The sample prep for Helicos is much simpler and involves no amplification (this was demonstrated in their recent yeast RNA profiling paper). For a lot of applications that’s probably not significant, but in those where you either want to analyze a lot of samples or have miniscule amounts of starting material or badly degraded starting material, these could be advantages.

  7. #7 Anon
    August 18, 2009

    The Heliscope has the ability to sequence small strands of DNA faster
    as well as large strands. Smaller strands tend to be more common in
    labs anyhow and serve as a means to check progress. Illumina’s service
    may require sending DNA to their facilities, which will take time. If
    the accuracy is not too important (e.g. seeing if a DNA insert was
    correctly ligated into a vector or counting the number of tandem repeats), Helicos’s machine should perform
    fine.

The site is currently under maintenance and will be back shortly. New comments have been disabled during this time, please check back soon.