I wrote last week about the dramatic presentation here at AGBT by Clifford Reid, CEO of new DNA sequencing company Complete Genomics. Reid made grand promises – entire human genome sequencing for $5000 available this year, and the sequencing of a million complete human genomes within the next five years – and presented some impressive data on the sequencing of their first human genome, from an anonymous American male.
Reid’s promises and data certainly caught the attention of the genomics community, and received decent media interest – the story was covered by New Scientist, Bio-IT World, Nature News and Bloomberg. The reason for the interest is simple: the $5000 genome that Complete is promising is dirt cheap by the current standards of genomics, and suddenly puts a lot of extremely valuable research projects – and even personal genome sequencing of individuals – within affordable reach.
Complete also appears to have caught the eye of major genome sequencing facilities; the Nature News article states that “[a] few centres have now signed on for pilot projects in which Complete Genomics will sequence five genomes at $20,000 apiece”. Only one of these (the Broad Institute) has currently been formally announced, but there are more on the way, and Complete also has a deal with the Institute of Systems Biology to sequence a further 100 genomes this year (announced last October).
So, can Complete deliver an accurate, complete human genome sequence at the promised price? While Reid’s presentation was impressive, I was left with a number of questions about the company’s technical approach and business model. I put these questions to Complete’s CEO Clifford Reid and CSO Rade Drmanac on Saturday morning.
Repetitive DNA and structural variation
Complete’s platform, like the current sequencing technologies from Illumina and ABI, employs “short read” sequencing – the genome is read as a series of tiny fragments that are then stitched back together informatically. Short read platforms pose major challenges when it comes to sequencing across highly repetitive DNA, and also in resolving large-scale structural variation (i.e. variable insertions and deletions of DNA).
Complete uses a “paired-end” approach, similar to those also adopted by Illumina and ABI, to help resolve these challenges. Basically, this means generating short reads from either end of a fragment of DNA of known length; this approach allows short-read platforms to walk their way across repetitive regions, and to single out chunks of DNA that are missing or repeated relative to the reference sequence.
The paired-end approach helps, but it’s not perfect – in the data Reid presented around 8% of the test genome could not be sequenced by their platform, and Drmanac told me that their current approach has a theoretical maximum coverage of around 95% of the genome.
Resolving the remaining 5% will require the application of a supplementary technology, called Long Fragment Reads (LFR). This approach first smashes a small amount of genomic DNA up into large fragments (around 100,000 bases each) and then partitions it randomly into 384 separate wells. After amplifying the DNA, you are left with wells that contain a random subset of the genome; sequencing each of those sub-sets separately (using a unique label) means that areas of the genome that are highly similar to one another (such as segmental duplications) usually end up in separate partitions, and can thus be resolved from one another.
The LFR approach won’t solve everything – it will struggle to separate small duplicated regions very close together, and sometimes duplicated regions will end up in the same partition by chance – but it should help to dig into the evasive 5% of the genome. As an added bonus, the approach would allow Complete to distinguish between the two copies of a chromosome present in an individual, effectively separating the copy you inherited from your mother from the one you inherited from your father. That’s something none of the current sequencing technologies can do right now, and it will be helpful – if it works – for hunting disease genes and performing population genetic analysis.
In addition, Complete has plans to develop paired-end reads using a number of different fragment sizes. This is an approach that has been tried with some success on Illumina’s platform, and I can’t see any good technical reason why it wouldn’t work with Complete’s technology; this approach may help resolve some of the larger repetitive regions.
Neither LFR nor the multiple fragment size approaches have been worked into Complete’s production platform yet, so it will be some time before it’s clear exactly how much of the genome can actually be captured by this technology. However, a more pressing concern comes from another area – error rates.
[Note: section edited 11/2/09 to correct calculation errors.]
Reid’s presentation included some sequencing accuracy statistics that sounded quite impressive – but even a low error rate can cause major problems when you’re sequencing an entire genome.
Based on Complete’s data (available here), there was 99.94% concordance between the sequencing and chip-based genotyping data of the same individual; on examination, only around 18% of discordant sites represent sequencing errors (the remainder are errors made by the SNP chip). That gives Complete an overall accuracy of just under 99.99% – meaning one in every ten thousand variants was called incorrectly. It’s hard to say exactly how many errors might accumulate over an entire genome sequence, but rough calculations would suggest somewhere on the order of 80,000-100,000 false positives and perhaps 1000 or so missed variants.
These errors exist despite the fact that each base in the test genome was covered by an average of over 90 separate reads, suggesting a substantial error rate in the raw reads (which may explain why 60% of the reads generated in the test run couldn’t be aligned successfully to the reference genome).
Of course, I need to emphasise that the error rate in Complete’s final product will almost certainly be much better than in this test data-set; Reid assured me that a substantial proportion of this error would likely be corrected once the company had a better handle on the types of systematic errors their platform creates. An accurate error model would allow them to adjust (at least most of the time) for the more common types of mistake.
However, it’s also worth bearing in mind that the test data-set had an average depth of coverage of over 90X (meaning each base in the genome was sequenced with over 90 independent reads, on average), whereas Complete is talking about offering commercial genome sequences with a coverage of just 40X. With a lower depth of coverage, the platform may require considerable improvements in accuracy in order to have a signal-to-noise ratio high enough for applications like finding a single mutation in a severe disease patient.
I would certainly expect this level of error to be substantially dimished by the time Complete’s product hits the market. Still, this is a cautionary tale for anyone looking forward to getting their complete genome sequence – all of the existing platforms have a high enough error rate to cause substantial error on a genome-wide level, so sequencing error will add an extra layer of complexity to the task of deciphering a human genome sequence. This will be improved with better chemistry, refined algorithms and high coverage, but it’s important to bear in mind that if you get your genome sequenced within the next few years you will almost certainly not be receiving a complete, error-free final product.
A couple of readers expressed interest in whether Complete intended to increase its read length in the near future. This is a difficult question to answer, due to the rather convoluted process by which Complete’s system reads DNA (put simply, by stitching together a series of 10 base pair reads of known distance from one another). Drmanac told me there are plans in the works to extend their 10-base probes out to 15 bases, but it was unclear whether this would be ready in time for their June commercial launch in June. This won’t actually have a huge impact on their effective read length, but I guess it will help to improve their accuracy by allowing some bases in each fragment to be sequenced multiple times.
Format of returned data
Like many potential customers, I was very interested in finding out how Complete is planning to return their clients’ sequence data. The answer, apparently, will be as a list of differences from the reference genome. If the LFR technology is used (and Complete is still not sure whether this will be default or optional), the variants will be “haplotype-sorted” – in other words, it will be clear which of the two sets of chromosomes each difference is located on.
Drmanac later told me by email that the data will also include quality scores – measures of confidence that a particular difference is actually real. I can’t emphasise how important accurate quality scores will be for interpreting a genome sequence: these scores, along with functional predictions, will play a major role in downstream algorithms for finding likely disease-causing variants for further validation and analysis.
Complete will need to demonstrate a strong commitment to data security, both in terms of maintaining patient anonymity and reassuring potential industry customers (e.g. biotech and pharma) that their industrial secrets are safe.
Reid told me that Complete would intially be offering its service completely blinded to the nature of the samples sent by customers, which is some reassurance. Still, that won’t be enough for many customers, and Reid said there were plans to develop “bank-level” security over the storage and transfer of data to customers.
Products on offer
Reid was very clear in his presentation that Complete intends to offer only a single product: complete human genome sequences. During my meeting with Reid and Drmanac I tried to clarify exactly where the boundaries lay.
For the moment, Reid told me, the “human” part is absolute – Complete won’t even consider sequencing chimpanzees, despite the fact that from a technical point of view a chimp genome is basically the same as a human genome. However, there are plans in the works to look at applying large-scale sequencing to human tissue in different ways (e.g. transcriptomics, epigenomics), so there is some flexibility on that front. In addition, Complete is very interested in looking at cancer genomes, which are often much further diverged from a normal human genome than a chimp is.
Why the curious choice of boundaries? Keith Robison is spot on: focusing only on large-scale human -omics will allow Complete to avoid the worst complexities of the service model (i.e. receiving many types of sample that require processing in many different ways), but still focus on the area where the market is the strongest.
Reid says that the goal of Complete is to create “a stream-lined factory” producing complete human genomes; by focusing on just one application (unlike any other genome facility) they can hone this process down to the point that they can do it cheaper and better than anyone else.
Other short-read platform providers (Illumina and ABI) claimed at the meeting that their technologies would be able to sequence complete human genomes for around $10,000 by the end of 2009. Reid argued that this price only covered reagents, and would also include lower depth of coverage (e.g. 25X for Illumina).
Right now there’s no-one on the immediate horizon that can offer a whole genome sequence for as little as $5000, and certainly not with the convenience of the service model that Complete is looking to build. If Complete can deliver on its promises it will have at least a few months of breathing space before competitors start to close in – unless, of course, there are other companies out there in stealth mode doing the same thing as Complete. We’ll have to wait and see.
Complete has demonstrated an impressive ability to convince venture capitalists about their potential, but to make real money they will need to convince their potential customers – researchers, biotech and pharmaceutical companies and DTC genetic testing providers – that their product is solid.
It will take a lot more than one presentation and a single genome sequence to convince people to buy in; people will be following the first few collaborations with sequencing centres like the Broad and the Institute of Systems Biology very closely. If the Broad is happy with the quality and price of the sequence they get back, you can expect to see orders start to come in fast from other labs.
Reid told me that although the precise mix of customers is still (understandably) unclear, he expected somewhere around 50% of
Complete’s business to come from researchers, and the rest from industry.
Most of the researchers I spoke to were cautious but interested in Complete’s product. There was very little excitement from a technical point of view – essentially, Complete’s product is just a faster, cheaper version of the other short-read platforms out there, not a potentially transformative technology like the long-read platforms of Pacific Biosciences or Oxford Nanopore – but if Complete really can offer an accurate, near-complete human genome sequence for $5000, it seemed like there will likely be plenty of potential customers in the genomics community.
Still, can Complete’s business model result in a profitable empire, given the looming competition and the expense of constructing massive genome sequencing facilities? We’ll just have to wait and see. In the meantime, I’m enjoying the sensation of the cost of my own genome sequence dropping gradually towards the “affordable” category.