…in Europe. I’ll get to that in a moment. You’ve probably heard of the E. coli outbreak sweeping through Germany and now other European countries that has caused over one thousand cases of hemolytic uremic syndrome (‘HUS’). What’s odd is that the initial reports are calling this a novel hybrid or some new strain of E. coli.
BGI has done some sequencing using Ion Torrent of one of these isolates, and Nick Loman assembled the data. Without getting too technical, the genome is actually in about 3,000 pieces, but with those data (and thanks to Nick for assembling them and releasing them) I was able to perform multilocus sequencing typing (‘MLST’). Basically, we look at the partial sequences of several genes (in this case, seven) to identify its sequence type–think of it as a molecular barcode (for the scheme and details, see here).
So what did I find?
This EHEC strain is most likely a very close relative of ST678 (details in a bit). In fact, according to the mlst.net strain database, there is a strain “Jan-91″, isolated in 2001* from Europe (no further geographic information is provided). That strain belongs to phylogroup D, and is associated with HUS…just like the outbreak strain. And the older strain also has the exact same serotype as the outbreak strain, O104:H4.
Now, the outbreak strain sequence isn’t identical (the data are at the end of this post). “Jan-91″ has an allele profile of adk6-fumC6-gyrB5-icd136-mdh9-purA7-recA7 (to orient you, adk is the gene, 6 is the particular variant of adk or allele). There are three differences:
1) In the outbreak strain, adk is a novel allele that differs from adk6 by one point mutation at position 30 (if I counted correctly; it’s late as I write this…)
2) In the outbreak strain, the icd allele matches icd136 exactly; however, the genome sequence lacks the last two bases. Given that the genome assembly is in over 3,000 pieces (‘contigs’), I think this is missing data, not biology.
3) In the outbreak strain, the recA allele differs from recA7 by one insertion. “Jan 91″ has a sequence of AAAA, while the outbreak strain has a sequence of “AAAAA” (below, it’s recorded as “aAAAA” to indicate the difference). With Ion Torrent (and other high throughput sequencing technologies), when you have ‘runs’ of the same nucleotide, such as “AAAA”, it’s not unusual for a base to be added or deleted, which could yield a ‘false’ “AAAAA.” This could be sequencing error, but I can’t rule out a real insertion (i.e., an extra A that’s real).
While this is obviously a very preliminary analysis of a very preliminary assembly, I don’t understand why this strain is being called ‘new’, ‘mutant’, or anything else. It’s not a bolt from the blue: it looks like a nearly identical strain that caused HUS a decade ago in Europe. I would add the obvious qualifier that there very well could be massive gene gain and loss (I haven’t looked at that yet). I’m guessing that the reports of this strain being very different were based on comparisons to the genomes of other HUS strains, which are pretty divergent. But we have seen this MLST type before associated with this serotype and this
MLST sequence type disease syndrome.
All that being said, this is a very serious outbreak–I don’t mean to downplay the seriousness of this as a public health and agricultural crisis by raising this issue. And it will be very interesting to see how different this strain is from other HUS strains. If we’re lucky, the “Jan-91″ E. coli strain still exists in someone’s freezer, and we can see how it’s evolved over the last decade. It’s especially disconcerting that this strain is resistant to so many antibiotics.
An aside: Many kudos to BGI for publicly releasing the data.
Update: There’s a new assembly using a different method. I haven’t checked that yet.
Update II: Others are doubting that this is a novel strain:
Quoting scientists at the University of Münster, the institute rebutted earlier reports that the newest strain of E. coli had never been previously identified, calling it a “hybrid clone” that drew together the virulent properties of other strains. “Reports that this is a completely new type of pathogen are not accurate,” the institute said.
Update IV: The second outbreak isolate genome sequence has the identical MLST sequence (ST678).
*2001 might be the sequencing date, not the isolation date. I can’t tell.
MLST data (on some browsers, this is getting cut off, so you can download it here):
*2001 might be the year it was sequenced. I can’t tell from the database.