Barcoding and classification, again

Blogging on Peer-Reviewed ResearchDuck and cover, folks. I'm about to upset somebody.

I have previously been fairly critical of DNA barcoding, the proposal to use a small fragment of the COI gene - a mitochondrial gene cytochrome c oxidase, subunit I - as a surrogate marker for species. That is, in simple terms, the use of COI sequences to "barcode" species so that straightforward molecular gene sequencing will tell you if you have a species or not, and how many there are in a given area.

Now I'm going to defend a new paper that proposes a barcoding method, for philosophical reasons. Here's the paper:

Inferring Species Membership Using DNA Sequences with Back-Propagation Neural Networks



Authors: A. B. Zhang ab; D. S. Sikes c; C. Muster d; S. Q. Li a

Affiliations: a Chinese Academy of Sciences, Institute of Zoology, Beijing, P. R. China b Albanova University Center, Royal Institute of Biotechnology, Stockholm, Sweden c University of Alaska Museum, Fairbanks, Alaska, USA d Molecular Evolution and Animal Systematics, University of Leipzig, Leipzig, Germany

DOI: 10.1080/10635150802032982

Published in: Systematic Biology, Volume 57, Issue 2 April 2008 , pages 202 - 215

Subjects: Animal Taxonomy; Bioinformatics;



Abstract

DNA barcoding as a method for species identification is rapidly increasing in popularity. However, there are still relatively few rigorous methodological tests of DNA barcoding. Current distance-based methods are frequently criticized for treating the nearest neighbor as the closest relative via a raw similarity score, lacking an objective set of criteria to delineate taxa, or for being incongruent with classical character-based taxonomy. Here, we propose an artificial intelligence-based approach—inferring species membership via DNA barcoding with back-propagation neural networks (named BP-based species identification)—as a new advance to the spectrum of available methods. We demonstrate the value of this approach with simulated data sets representing different levels of sequence variation under coalescent simulations with various evolutionary models, as well as with two empirical data sets of COI sequences from East Asian ground beetles (Carabidae) and Costa Rican skipper butterflies. With a 630-to 690-bp fragment of the COI gene, we identified 97.50% of 80 unknown sequences of ground beetles, 95.63%, 96.10%, and 100% of 275, 205, and 9 unknown sequences of the neotropical skipper butterfly to their correct species, respectively. Our simulation studies indicate that the success rates of species identification depend on the divergence of sequences, the length of sequences, and the number of reference sequences. Particularly in cases involving incomplete lineage sorting, this new BP-based method appears to be superior to commonly used methods for DNA-based species identification.

Keywords: Back-propagation; DNA barcoding; incomplete lineage sorting; neural networks; species identification

OK, so why do I like this paper over previous proposals? The authors point out the primary concern I have about existing barcoding techniques:

... an a priori similarity cut-off is needed to determine species status using these methods. It remains questionable whether such universal cut-off values exist, even among congeneric species ... . ... information is inevitably lost when differences among sequences are converted into genetic distances.... ... these non–character-based methods are also criticized as being incompatible with classical character-based taxonomy....

In other words, existing techniques rely on arbitrary metrics, missing information and are incompatible with how we do taxonomy based on homologues ("character-based").

As has been pointed out many times, this makes these species identifications formally identical to phenetics, classification based on similarity. But what these authors propose is rather different, though they may not realise it themselves. What they suggest is the use of an artificial neural network system, particularly a back propagation network. Backprop networks of this kind "learn" from a training set - you give the ANN exemplary cases and it "inductively" extends these lessons to new cases.

Note that the system is not using an arbitrary degree of similarity here. The ANN doesn't include or exclude based on a threshold that is a priori, but upon one taken from actual cases. This effectively means two things: one that the similarity is empirical, rather than conventional, and likely to be more accurate, since it is learned from close relatives. The other is that the relations here are relations of identity, not similarity. An ANN of this kind learns to classify the way human taxonomists and specialists (in this case, of beetles) learn to classify, but training on exemplars and dealing with encountered exceptions. The sameness here is the sameness of informative characters, which the system comes to "know". These are, as I said, effectively homologies (although one might not expect that the homologies used by the system will be the same ones humans use - ANNs can behave quite surprisingly).

What is also nice about this paper is that it tests the accuracy of the system against known data. They offer, in their words, rigorous tests of the methodology. This disposes of my remaining objection to barcoding. It still (as the authors say) isn't a replacement for alpha taxonomy, but there's a possibility it may deliver on the other promises of barcoding.

Categories

More like this

I think because it falls back into cluster principal component analysis, even though it is done self-organisingly, and hence is a form of phenetics itself. But I will have to think about that for a decade or so. I'll get back to you...

By John S. Wilkins (not verified) on 03 May 2008 #permalink

I love this typo!

"...such universal cut-off value sexist."

It's about to disappear!

By John S. Wilkins (not verified) on 03 May 2008 #permalink

John - a threshold could be chosen empirically from the standard methods too: you get your tree, and then draw a line across it and declare than anything branching below the line is within species. Then you move the line up or down until you maximise the concordance in species attribution.

I'm generally sceptical of AI methods, because they're black boxes. For classification where there isn't a strong structure to the problem, they're probably as good or bad as any other system, but here they're throwing out a lot of knowledge (e.g. patterns of common descent). In this sense they are arbitrary: the model they produce can have no relationship to the mechanism that produced the data.

It's difficult for me to see how ANN methods can work better on sequence data, unless the evolutionary model being used is wrong, and not particularly robust. They might work better for morphological data, where the evolutionary model is less clear.

Lassi - I haven't read the paper, but I think the problem with using all of the data might be one of checking the ability of the NN. It sounds like they're using cross-validation to see if the NN is finding anything useful. You really have to do this, because the intention is to predict for new data, but in using the full data you just parrot what's there. It's a problem with black-box techniques. Model-based classification doesn't have to do this because it forces the classifier to use the similarities that the model thinks are important.

The paper you're discussing compared their method against a previous attempt to analyze data in a way that differed from the usual method:

Abdo, Z. and Golding, G. B. (2007) A step toward barcoding life: A nodel-based, decision-theoretic method to assign genes to preexisting species groups. Syst. Biol. 56: 44-56.

Brian Golding was a co-PI on the grant that funded the first national barcoding network (which Paul Hebert and I co-wrote), which as far as I know paid for this study. For the past two years I have been a member of an advisory committee of a student whose project looked at analyzing barcode data using specific SNP-like diagnostics depicted as networks. My former postdoc advisor (Rob DeSalle) has promoted a character-based approach. Looking at more effective means of delivering species identification is important -- the point is to make it work, not to push a preferred analytical philosophy.

The aim is to use the simplest approach that can recover existing species names and, using some criterion (e.g., distance) calibrated against existing taxonomy, to identify probable new species in need of alpha taxonomic description. If current methods (one gene, NJ clustering) are not enough, then they will be updated as required. I hate to say it, but you are not even in orbit around the loop on this subject.

In summary, I don't think any DNA barcoders will be upset by this study (unless it is somehow erroneous) because their focus is on maximizing the effectiveness and efficiency of the tool -- these aspects are decided empirically.

I'm arriving at the issues associated barcoding from a different persective, as one who is aiding the identifications of materials used for barcoding, based upon knowledge derived from study of morphological variation and matching it to that found in type secimens (the essential step in determination of validity, given that most types can not be barcoded themselves). Hence, I hope I can be foregiven for displaying a certain degree of ignorance. Nonetheless, I am interested in improving my understanding of the nature of the assumptions those involved in barcoded make regarding interpretation and use of sequence data for phylogenetic inference and for making identifications.

This paper is interesting in that it seems to provide yet another approach for establishing how similar a sequence (or subsequence) must be before it is regarded as identical. However, I seem to be confused in understanding the apparent "bias" in the above posts in rejecting methods because they are somehow "phenetic" in making the determination of topological order (how similar does a particular sequence (or subsequence) have to be before regarded as identical. Either they are are identical with respect to some criteria that defines the fundamental topology of the neighborhood of relevant elements or they are not.

The thrust of the criticism above is that some methods are using an "arbitrary" degree of similarity. Since one might assume that mutation, presumativey largely random, is sifted through selection that may be fluxuating or of uncertain direction and reproduced with potentially varying degress of exactness to produce the observed sequence, I find it hard to understand why all methods wouldn't be expected to be "arbitrary".

I also don't fully understand, whether using ANN methods or "NJ methods" (or other presumtively suitable definitions used to establish topological order relations) why one would not expect somewhat "arbitary" outcomes since, they all assume that matching and ultimately phylogenetic relationships can be inferred through matching (correspondence) of nucleotides. In cases where a base subsitution at a particular position has occurred and then reverted to its original form, one would not be able to distinguish change (evolution) and non-change even though in one case you have 2 changes at a particular position and in the other case there are none. There is no unique solution, although its understandable that workers would tend to seek Hausdorf as opposed to non-Hausdorf criteria simply to make the problem simpler.

Hence, it seems odd, to me at least, that some would seem to suggest that "non-phenetic" methods are in some sense more appropriate than "phenetic" ones (or visa versa) with respect to the fundamental criteria of establishing how different does one sequence (subsequence) have to be before it is regarded as different (or its converse, how similar does it have to be in order to be regarded as identical) or as a topologist might ask, what defines neighborhoods relevant to the structures in question? Both approaches suffer the same problem, as neither can distinguish the changed/not changed outcome discussed above. If one assumes equiprobability of change, for several hundred presumtively homologous positions, this effect is potentially not insignificant. The problem becomes more intractable if one considers the potential for non-independent positional change (due to differential mutation or differential selection), leading to an inherrent uncertainty in any such estimated matching (ignoring the complications that might arise by assuming chaotic behavior at potential points of bifurcation).

Are there papers in the theoretical molecular literature discussing the issues associated with the problems I describe here?

As primarily a morphologist/alpha-taxonomist, I am keen to understand better the fundamental assumptions made by molecular biologists in attempting to infer taxonomic validity (appropriate name for a given taxon given the ICZN) and in making inferences about the propinquity of descent based on bar coded data.

It is a shame that the COI gene tells us little about what is going on in the nuclear genome that might be used to relate studies of morphology and molecular biology, thereby studying cause and effect. Nonetheless, such correlations provide interesting proxies and provide an alterative perspective and source of data, assuming of course, that one understands the assumptions inherent in the methods used to make one's inferences.

By turkeyfish (not verified) on 04 May 2008 #permalink

Bob O'H (#5): "You really have to do this, because the intention is to predict for new data, but in using the full data you just parrot what's there."

I see no need to predict new data. By doing the analysis with an ANN you can get another "opinion" on the taxonomy. If the results are the same as humans have concluded, it is an interesting observation about ANNs. If the results are different... that is interesting, too.

"Model-based classification doesn't have to do this because it forces the classifier to use the similarities that the model thinks are important."

If you force the result to an a priori model, the biases of the model just get enforced. I'm curious about what the results were if there is no model, i.e. when they come from an unassisted learning algorithm.

By Lassi Hippeläinen (not verified) on 04 May 2008 #permalink