Let’s start with some numbers. Like both Illumina and SOLiD, the HeliScope generates DNA sequence data as a massive collection of very short reads – but while the Illumina platform is now routinely generating reads over 100 bases long, the HeliScope generates reads on average just 32 bases long, with only a tiny fraction exceeding 50 bases in length. In fact, the reads are deliberately filtered to exclude any extending for over 70 bases, as these are highly enriched for technical artefacts.
Stitching together a genome sequence with such short reads is a substantial challenge, especially in regions where the sequence is repetitive – and indeed the technology can only cover 90% of the reference genome compared to 99.9% for a genome recently sequenced to similar depth with Illumina.
To be fair, Illumina achieves this in part by generating reads in non-independent pairs separated by a known distance (so-called paired-end reads), which are possible to generate on the HeliScope
but weren’t used in this study, which was performed six months ago. Clearly genomic coverage will already have improved as Helicos bring paired-end runs online.
The short read length of the HeliScope limits its application, but the most worrying problem with the technology is its error rate: 3.6% of the bases in its raw reads are wrong, a substantially higher error rate than current second-generation platforms. The high error rate results largely from so-called “dark bases” – bases that don’t produce the fluorescent signal the HeliScope requires to read a sequence – which result in an apparent deletion in the read.
As a result of the short reads and high error rate, the Helicos team had to throw away 37% of the reads they generated since they couldn’t be effectively mapped to the reference genome.
Calling genetic variants
Despite the challenges of mapping their short, error-prone reads, the team generated enough reads to cover the mappable 90% of the genome an average of 28 times per base, and that level of coverage (comparable to the depth seen in recent Illumina-based papers) meant that the errors in their raw reads could be largely cancelled out by the addition of more reads in the same place.
As a result of this depth of coverage and the generally low rate of base-swapping errors (as opposed to deletion errors), their accuracy for calls of single-base variants (SNPs) seems quite reasonable. They could call 97% of SNPs with 99% accuracy, which is still worse that second-generation approaches but not terrible for a rough draft genome.
However, the potential for the HeliScope to call small insertion/deletion variants remains untested – the authors didn’t even attempt it here, and I can only assume that it will be non-trivially complicated by the prominence of deletion errors in the reads. Calls for larger insertion/deletions (copy number variants, or CNVs) are seriously restricted by the techniques’ inability to extend into repetitive regions – the very same regions that are most enriched for these important variations.
In the media flurry around this article (see links below), Quake and his team appear to be pushing the line that the HeliScope is a feasible alternative to established second-generation platforms for smaller labs:
“This is the first demonstration that you don’t need a genome center to sequence a human genome,” Quake said in a statement. “This can now be done in one lab, with one machine, at a modest cost.” [GenomeWeb]
In the supplementary information the authors go so far as to compare the size of the author list in their study (a genuinely remarkable number: three) with previous published genomes (e.g. 196 authors for the first Illumina genome), apparently to demonstrate that the HeliScope takes less effort to run than its competitors – in the table legend they state that “the number of authors is an estimate of labor”.
This is rather silly, of course: the length of an author list on a genome paper has no necessary correlation with the ease of operating a technology. In Kevin Davies’ excellent article on the announcement in Bio-IT World
, Clive Brown from third-generation competitor Oxford Nanopore has a trenchant response:
Brown, who was formerly with Solexa and Illumina, said it was misleading to compare the three co-authors on the Stanford paper with the 250 or so on the landmark 2008 Illumina publication in Nature on the first African genome, because “that paper was the culmination of eight years work.” He noted that an earlier 2008 Helicos publication had more than 20 co-authors to sequence a tiny viral genome.
(As an aside, in the same article Brown also delivers an entertainingly back-handed compliment on the Helicos technology: “They’ve stuck with it, and they’ve made it work about as good as it can work with single-molecule fluorescence and the camera they have. [...] That’s not trivial.”)
It’s unclear to me that the work involved in generating data on the HeliScope is actually that much less than that involved in using Illumina or SOLiD machines. Certainly the cost difference in terms of reagents is marginal at best; the authors estimate that this genome cost them $48,000 in reagents, which is exactly the price that Illumina is now offering for a retail genome sequence
, and over twice the price that Complete Genomics
is currently charging genomics facilities
. And given the non-trivial up-front cost of a HeliScope – close to a million dollars, last I heard – this is hardly an infrastructure investment that most small labs will be able to consider in the near future.
One final point here: one of the requirements of next-gen sequencing that is frequently under-played is the need for informatics support and infrastructure. Very few small labs are equipped to deal with the sudden influx of terabytes of short-read sequence data; most lack both the hardware and the expertise to cope with such an onslaught. If Helicos or any other next-gen sequencer is to push into the small lab market it will need to invest heavily in the provision of powerful hardware and extremely user-friendly software to potential customers, to ensure that the people who receive their machines don’t find themselves completely unable to do anything with the resulting data.
Where to now?
This paper sets the bar pretty low for other third-generation sequencing contenders: it appears that formal entry into the human genome sequencing race merely requires generating a genome sequence of the standard that second-generation sequencers were achieving in early 2008, at the same price that they’re charging right now. That’s a fairly uninspiring goal.
I anticipate more exciting offerings in the near future from other third-gen providers such as Pacific Biosciences
and Oxford Nanopore
(long-term readers will know that I am a particular fan of Oxford Nanopore’s approach
). The long-read, single-molecule approaches being developed by these companies will have a massive impact on the completeness and accuracy of human genome sequencing once they achieve the necessary cost and throughput milestones.
Basically, stay tuned: single molecule sequencing is the future, but the future isn’t quite here yet.
Links for further reading