Guest post: Luke Jostins on the twice-sequenced genome

While I continue my work-induced blog coma, here's a guest post from Luke Jostins, a genetic epidemiology PhD student and the author of the blog Genetic Inference, delivering a fairly scathing critique of a recent whole-genome sequencing paper based on Life Technologies' SOLiD platform.

McKernan et al. 2009. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding Genome Research DOI: 10.1101/gr.091868.109


In prepublication at the moment is a
paper
from the labs of ABI,
makers of the SOLiD sequencing system. It is the first published human genome to be sequenced entirely by SOLiD, the lowest-coverage non-454 second generation genome, and, interestingly, the second whole genome publication to not come out in either Science or Nature. There is a lot of interesting stuff going on with this paper, and with the discussions going on around it, and there are a lot of stories to tell about it.

 

Genome Blogging

This is a case of the genomics blogging community breaking a very interesting story hot off the press. Dan Koboldt, in a blog
post
at the blog MassGenomics reported the pre-publication of the SOLiD genome very rapidly after its appearance. He had a very interesting insight; the individual sequenced was the SAME INDIVIDUAL as Bentley
et al
, the Illumina genome! The SOLiD paper makes no comparison, and does not even mention that they are, despite reviewing previous work done on this individual:

 

The NA18507 genome has not been Sanger sequenced to more than 0.5Ã sequence coverage (Kidd et al. 2008). However, it has been extensively genotyped as part of the HapMap project (Frazer et al. 2007) and some regions shotgun sequenced to higher depth as part of the ENCODE project (Birney et al. 2007).

Score one for blog-based reporting. Score another one for the fact that Kevin McKernan, lead author of the paper, spoke up in the comments of Dan's post, saying that they had not done a comparison due to the lack of genotype information published by Illumina. Why they don't mention this justification in the paper, or even comment on the existence of a first higher-coverage second gen sequencing project on the same individual, is far from clear (and it is a bit distressing that the reviewers failed to pick up on this).

Anyway, a quick-and-dirty comparison is not too hard, so lets do one now.

Illumina Vs SOLiD

A quick word on the relevant differences between Illumina and SOLiD in terms of sequencing: the Illumina system looks at single nucleotides each dyed with a different colour, so the the sequence ACATCA would give green for A, blue for C, green for A, yellow for T, and so on; in constrast, the SOLiD system reads pairs of bases as single colours (giving 16 colours in total); the sequence ACATCA would give purple for AC, orange for CA, red for AT, and so on.

Note that the SOLiD method determines each base pair from two different bits of information: this makes it less prone to sequencing errors, since an encorporation error has less of an effect on the call. SOLiD's claim is that this higher accuracy allows them to call base pairs accurately using only two aligned reads, meaning you need lower sequence coverage: the SOLiD genome only has 17.9-fold coverage, which is a lot less than the 40-fold coverage used by Illumina.

 

SNP and indel performance

The SOLiD paper calls a simular number of SNPs as the Illumina paper: just under 4M, of which just about 0.8M are new. They validated these both by genotyping about a hundred independently, and by checking their homozygous calls against the HapMap genotypes. Despite the low coverage, SOLiD had accuracy similar to Illumina's; 0.76% false positives, 99.1% of HapMap sites called, 99.16% concordance (this compares to about 0.62%, 99.3% and 99.07% from Bentley et al).

They called about half as many indels as Illumina (0.23M verses 0.41M), but they managed to call far more varied size; they have inserts in all size ranges up to 1kb, and deletions all the way up to 100kb. However, I expect these differences are as much down to the algorithms they used to call these indels as the sequencing technology used - it would be interesting to re-analyse the Illumina and SOLiD data using the same alignment and calling software. This would be pretty easy to do; the data is all available online: Illumina and SOLiD.

 

Going Deeper

The SOLiD genome paper is, as far as I know, the most in-depth study of variation within a single human genome, and they try out a lot of pretty impressive tricks. Beyond indels, they also find 91 inversion and 565 copy number variations; when combined with the indel data, they effectively have information on structural variants ranging in size from 1 bp to 1Mb.

They manage to do phasing [that is, inferring whether nearby variants were inherited on the same copy of a chromosome - DM] using paired end read data; when you are very confident of every base call in every read, you can start assembling these reads into individual haplotype contigs [i.e. long stretches of sequence that can be said with high confidence to have been inherited as a block - DM]. They manage to produce blocks of phased haplotypes up to 215Kb long, and confirm their phasings are about 99% accurate compared to HapMap SNPs.

Along with this, they make an in-depth study of the functional implications of the variations they see, including looking for OMIM variants, potential gene disruptions and even a handful of potential gene fusions. They also use functional information to produce lists of protein families and genes functions that have been under different types of selection (though I have my doubts about how reliable this is).

 

The fudges

So, SOLiD wins, yes? They have done a much lower coverage sequence of the same genome as Illumina, matched their SNP calling for quality, beat them in range (if not quantity) of indel sizes, and do a load of other fancy tricks, all for less than half the coverage. Right?

Well, no, not really. SOLiD cheated, in a number of ways.

Firstly, they fudged their coverage; while their abstract declares a 17.9-fold coverage, their Results section declares that they aligned over 87 Gb of sequence to the genome. They generated a total of 89 Gb of sequence, but the only way to learn that is to look at their NCBI submission; this is a 31X genome. They got the 17.9-fold coverage by throwing out 41% of their reads, by applying various quality control criteria. Now there is nothing wrong with QC, but there IS something wrong with taking the best 18X coverage of a 31X genome and pretending that you have an 18-fold genome. They will have seriously enriched their sequence data for high-quality reads, and thus have much higher quality sequence than what you or I would have if we ordered 17.9-fold DNA from a sequencing department.

Secondly, they fudged their genotype calling. Their HapMap validation was only performed on homozygous SNPs - but homozygous SNPs are the strength of SOLiD's low-coverage, high-accuracy approach. It is heterozygous SNPs that are the weakness, since no amount of read quality can compensate for not seeing both alleles. And the authors completely fail to report their heterozygous genotype concordance. They do, however, give a graph, and by sitting down with a ruler we can get an estimate of the quality of het calls, and it is something like this: 96% called, 91.2% concordance. When averaged over all HapMap SNPs, SOLiD quality drops to something like 98.1% called, 97.5% concordance. This is not good performance, and this fact is nowhere indicated in the SOLiD paper.

Finally, a relatively minor gripe, but one that kicked me while I was down. SOLiD claim that you could replicate their study in "just one or two 30-50Gb runs from a SOLiD instrument at an estimated reagent cost of under $30,000". Really? To generated 89 Gb of sequence? It took them 3 runs, and while SOLiD have at times had 50 Gb runs, they are still the exception. I expect it would cost around $40-50k to repeat this sequencing experiment, based on it taking 3-4 runs.

 

What is going on here?

What follows is idle speculation, and thus (hopefully) not slander. My guess is that SOLiD is attempting to reposition itself as the Low Coverage Sequencing Company. They have still failed to topple Illumina as the market leader, and I expect that they think if enough people start thinking of SOLiD as "the guys who did a good quality 18X genome", all the people who found the low coverage 1000 Genomes Project work sexy will start looking to SOLiD. But to do this, they had to fudge a few things: get the coverage down by cutting out lots of reads, obfuscate the low quality of heterozygous calls, and give an overenthusiastic estimate of how cheap their technology is.

It would be interesting to know why this paper did not end up in Nature or Science. This is only the second whole-genome sequencing project to not do so [after the SJK Korean genome, also published in Genome Research - DM], and people are saying that this is because whole genomes just aren't new enough any more. But I don't think that is the case; this isn't just any genome, it is both the first SOLiD genome, and by far the most detailed analyses of a newly-sequenced individual we've seen before, with a lot of new and interesting stuff in it. I think that the reason it is not in one of the big two might have something to do with the methodological flaws in the paper; Nature or Science might have asked the wrong questions. But perhaps Genome Research was desperate enough to grab a genome, and not ask too many questions about what SOLiD had left out. It is distressing that it took a blogger like Dan Koboldt to start blowing this thing open.

More like this

NA18507 is available from both GA2 and ABI SOLiD, but unlike NA07022 from Complete Genomics, I can't use it. I want the rs#s and genotypes.

Does anyone have a way to cross reference SAM/BAM (or other sequence format) against dbSNP?

Yes, well done, and an excellent read. I hope somebody jumps all over the two sequence trace archives you link to and does some more comparison. It would be nice to compare coverage variability from region to region, or coverage gaps (size and frequency), for example.

But... I think you've forgotten the Venter genome, which was also a whole-genome sequencing project (albeit not using NGS), and was not published in either Nature or Science (for reference, it was in PLoS Biology).

@Richard

Yeah, the Venter genome was a big oversight on my part (you are right that I forgot about it due to it not being a NGS genome). That actually makes 3/7 WGS that weren't published in Nature/Science.

Though I suspect that the Venter genome was published in PLoS Biology as a show of support for Open Access, rather than because it was not big enough to go for Nature/Science.

Luke,

first thank you for the post, this is great.

you point out that 41% of the reads were not included in the AB count, explaining the difference between 30x and 18x. Is it because they do not map? If this is the case I suspect other groups have done the same thing by reporting in the typical 30x formulation only the reads that map? Are we sure that the same thing does not happen with Illumina when they report a 30x genome? I have no idea but I would like to know.

A key question for comparing both technologies: if Illumina says they generate x reads, what % is usable? And what is this percentage for AB? I am looking for some sort of normalization factor to compare both technologies fairly.

Vincent

By Vincent Plagnol (not verified) on 04 Aug 2009 #permalink

Excellent post, but let's get one thing straight. With all respect to Dan Koboldt, it's not been a secret that ABI and Illumina were sequencing the same Yoruban HapMap genome. That's been well known since the Marco Island earlier this year.

That said, the ABI paper was submitted 10 weeks after the Illumina paper(s) in Nature (November 2008), so even if there wasn't time or opportunity to do the comparison, the precedence should have been acknowledged and discussed by the authors, if necessary with the prompting of the reviewers and the editor.

@Vincent

In both the Illumina and SOLiD papers about 97% of reads aligned. The 41% figure is the proportion of ALIGNED reads that SOLiD threw out. This doesn't happn with Illumina, because Illumina has tended to use Maq for alignment, which has a fairly strict 'don't throw reads out' policy, instead using mate-pair alignment correction and good error modelling to make every read count.

Either way - if we use like-for-like comparisons, Illumina generated 40X of alignable reads, SOLiD 30X.

Luke,

Disclaimer : I am out of date on the Solexa/Illumina system since June of last year so what i say here may no longer be accurate.

I think you are barking up the right tree. I was concerned (at Solexa) about comparing runs from machine evolutions under different conditions over time. Also in identifying genuine errors in chemistry as distinct from detection artefacts - important for platform optimization.

That Solexa platform filters 'overlapping clusters' i.e. overlapping in 2D on the chip surface having been randomly arrayed. They thus have convolved signals from more than 1 DNA template. These are clearly not the same as errors and not of much practical use either. They are 'PF' filtered away. What is left is data from single templates.
The philosophy was to then force everything from this pure fraction to align to its best location(s), then count the errors. PhageAlign was written to do this. By isolating single templates and forcing them all to align, your 'error' count is thus comparable between data sets and representative of your platforms behavior. Unfortunately, on larger runs PhageAlign was too slow, Eland was then used, but this only dealt with 0,1 or two mismacthes in the sequence. As it turns out, typically, over 96% of the reads would align under this constraint because the error rate were quite low anyway (perhaps this has improved since). What didn't align often turned out to be contaminants, imaging artefacts and other cruft including some high error reads. Solexa error rates and yields were calculated in this way and I attempted to give a talk about this whole very subject right the way back at AGBT 2005 (I think). We were a bit concerned by a paper that came out around that time where data from another NGS platform had been 'filtered by alignment' - effectively using a survivorship bias to cream off good data, then error rates were calculated from this data and quoted as representative of the platforms behavior as whole or perhaps in isolation of quoting what fraction of the total data this represented (cant remember the authors) - but it seemed a bit odd and we didn't want to do it that way. Again I covered this in my AGBT talk perhaps badly.

perhaps with beads, which don't visibly overlap but may have more than one template amplified on them, you can't do an upfront PF filtering equivalent and you have to throw everything at the target and do it that way ? But i agree that when measuring your platforms overall efficiency you have to state used-data as a fraction of total-data per unit time.

c.

By Clive G. Brown (not verified) on 05 Aug 2009 #permalink

Hi,

Nice article but I'd argue this point:

-it would be interesting to re-analyse the Illumina and SOLiD data using the same alignment and calling software.

You can align the reads using the same software, but you cannot call SNPs from Illumina and SOLiD using the same algorithm because SOLiD has dibase encoding. In Illumina, a one base difference from the reference could be a SNP, while in AB it will always be a sequencing error - two base changes are required for a SNP.

This is probably why they got roughly the same SNP rate with fewer aligned reads.

Also, I would not get too hung up on coverage stats. Illumina also produces a lot of crappy reads that get filtered out in the pipeline before you try to align them. AB gives you more reads, but you may get a lower alignment rate.

Also, you can argue that AB cheated by cherry picking its best reads, but the final SNP calling algorithm is going to look at the quality score of the base called anyway, effectively doing the same thing. So, Illumina reads may not have been filtered before SNP calling, but the SNP caller itself will give less weight to low quality reads.

I think the most important stat would be to say Illumina can call SNPS with X accuracy on a human genome for Y dollars, vs AB's accuracy and cost.

The stats in the middle may being losing sight of the forest for the trees.

Thanks for the post!

Thats an exciting post!

Definitely comparing to what is out there already should be compulsory.. same with the NGS tools that keep coming out :)

Going back to the supplement data, Solexa had a 55x coverage if looking at their passing filter (useful reads) and 40x coverage looking at 'Non duplicate PF reads aligning to genome' which I am sure does some QC. So I would not say 'cheating' for solid's 18x from 31x raw coverage.

Its unfortunate though, re-sequencing a genome done using competing technology, I am still unable to find how much SOLID's 2-base encoding really helps, when compared to Illumina/solexa

By Sumit Middha (not verified) on 05 Aug 2009 #permalink

"it would be interesting to re-analyse the Illumina and SOLiD data using the same alignment and calling software. This would be pretty easy to do; the data is all available online: Illumina and SOLiD"

Which sort of begs the question as to why you didn't do that before taking random potshots at this paper.

"What follows is idle speculation, and thus (hopefully) not slander. My guess is that SOLiD is" not your sequencing platform of choice.

What a great read. I skimmed through the paper and missed the coverage issue and the fact it was the same genome, well spotted.

I would very much like everyone reporting what they are getting out of these instruments to be very much more open with exactly what they did. It is a pain to have to refer to online supplementals just to find out that data was from three runs and not one.

As this technology evolves so rapidly making any comparison today is almost useless. We have seen about a tenfold increase in like-for-like sequencing on Illumina from GA1-GA2x (1x36bp runs).

Another gripe is the marketing of 100Gbp by the end of 2009, and other comments whilst not making it clear that this comes at X additional cost and Y additional time. At the same time the marketing people from Illumina and ABI take pot shots at 454's use of read length as being so important.

Unfortunately this is the latest biology cash cow for investors. If Google can get evil will Ilumina follow ABI's lead from the past ten years?

By cancerboy (not verified) on 06 Aug 2009 #permalink

You may want to correct your explanation of 2-base encoding on the SOLiD. There are only 4 colors, not 16. The color-space code is redundant. For example AA TT CC and GG are all blue. AC CA TG and GT are all green, and so on. Each base is queried twice â once with the base before and once with the base following. Taking your example: ACATCA would give AC=green, CA=green, AT=red, TC=yellow, CA=green. This sequence of colors could equally represent ACATCA or it's complement TGTAGT (or CACGAC or GTGCTG). If you know the first base is A, there is only a single DNA sequence possible. A single SNP in the always affects 2 adjacent color calls, so it can be distinguished from an instrument error that changes only one color call.

87 GB was raw sequence, not mappable. Of the 87, 59% mapped to the reference (i.e 41% did not map) and used for downstream analysis. This is in contrast to Illumina, which uses Chastity to filter reads before even mapping to the genome.

A common misunderstanding, but misleading when used to argue that SOLiD is "enriching" for good reads. Not the case.

@JStuart

That isn't what the paper says. From the 'result' section:

We generated 76.53 Gb of mate-paired reads that align to the reference human genome (hg18)

and

In addition, we generated ââfragmentââ reads (45â50 bp), of which 10.5 Gb [...] align to hg18

The SRA entry says that 89 Gb of raw sequence were generated.

The 59% that they (well, you) used was sequence that passed a number of mapping-based filtering steps. This is in contrast to the Illumia paper, which reported their coverage for the 97% of reads that mapped at all.

But you (and @Sumit Middha) are right that Illumia is also guilty of doing a similar 'hidden QC' thing in the Bentley et al paper - they 'purity filtered' out 18% of their reads without indicating that they are doing so in the text of the paper, giving the impression that they generated 135 Gb of sequence data when they in fact generated 165 Gb. I should have picked up on that when writing the post. I suppose this is generally a problem of attempting to get a fair assessment of a technology from research published by the manufacturer.

Actually, the paper says that "we generated 76.53 of mate-paired reads that align to the reference human genome (hg18)of which 49.07 Gb have both tags mapping to the genome as uniquely placed reads."

So it appears to me that they used only uniquely mapped reads that could be paired for the downstream analysis (at least for mate-pair), and then combined some of the fragment reads to compliment the analysis. Can someone explain to me how this is cherry-picking? I would think that if one performing an analysis, that you would want only paired reads that mapped uniquely.

Someone please explain.

@AltaS

In general, people don't use such strict mapping criteria. The alignment/calling system for Maq and BWA (the most widely used aligners), for example, map the vast majority of reads (97%) and assign a mapping quality based on how well each maps. It then handles the read and mapping quality scores to make calls based on the full depth, without throwing out any reads.

There is nothing inherently wrong with the way ABI did it, setting a high criteria for mapping is fine, if that is the way you want to go. The problem is that what they have then is not a 18X genome, it is a 31X genome that you have chosen to process in a particular way. The problem I have here is that, by claiming an 18X genome in the abstract, they are inviting comparison to all the 30-40X genomes that have already been published; this is not a fair comparison, because by setting a very high criteria for mapping, relative to these other genomes, they are cherry-picking just the best 18X of reads.

Does that make more sense?