A Sequence Like Poison

New Scientist reports on research to identify DNA sequences that cannot be found in any nucleotide database. These sequences are short -- so as to decrease the probability that they are missing due to chance alone -- and the researchers from the Boise State University have identified over 60,000 15 nucleotide stretches of DNA that are not present in any known sequenced region from all species. They also found 746 sequences of 5 amino acids that are not present in any known polypeptide. The article does not indicate whether the scientists utilized any hook and ladder or Statue of Liberty approaches in their analysis.

The researchers postulate that some of the sequences cannot be tolerated by living organisms. Many of them are probably missing simply due to chance -- and who knows how many will be found as our DNA sequence databases continue to expand exponentially. But these guys are specifically interested in those sequences which may act like genomic poison. They plan to test some of the amino acid sequences in bacteria to see if they can be tolerated.

More like this

I checked a short sequences a few years ago, when ego-BLASTing was in vogue. The most interesting sequence that doesn't exist, IMO, is SATAN, which does not turn up on a BLAST search of the protein databases (even though it contains four of the commonest amino acids). It's the most powerful evidence I know of for ID!*

*(Of course, it's piss-poor evidence for ID, but I stand by my statement.)

Would there necessarily be any sort of correlation to viral DNA?

Also, can you elaborate on the Hook and Ladder or Statue of Liberty approaches?

From the New Scientist article it isn't clear if they've considered simple steric effects for the missing peptide sequences -- some series of amino acids may simply not fold well due to sidechain crowding and have never made it into a functional protein for that reason.

Interesting suggestion, Jonathan - I'm not sure if you could draw any conclusions regarding energy requirements for folding or sidechain crowding from the five amino acid sequences that were provided - as the tertiary folding of the peptide would be influenced greatly by the flanking amino acids. Regardless, I don't have any alternate suggestions.

Dave,

Surely you could at least bootsrap a la Tanford to get a rough estimate for free energy of folding, no?

My guess (sight unseen) is that many of these peptides are laden with polar residues, and are relatively expensive energetically to maintain.

The article does not indicate whether the scientists utilized any hook and ladder or Statue of Liberty approaches in their analysis.

Can you say a little more about these? Google wasn't much help

.. or am I displaying my newbish-ness by having missed a joke?

'Hook and Ladder' and 'Statute of Liberty' refer to trick plays rarely used in American football, but which were used to great effect in Boise State's recent victory over the much-favored team from Oklahoma Univeristy. These are highly humorous, but purely North American-centric references.

About "hook and ladder" and "statue of liberty": what Tom wrote. They're references to American football -- specifically a game involving Boise State.

I'm sure there are lots of reasons why certain peptide sequences are avoided -- Jonathan's being one of them, and Brian presenting another. Regarding steric effects, it would be interesting to have more info on the secondary/tertiary structures of more proteins. I'd bet certain peptide sequences would have different steric effects depending on the region of the protein in which they are involved.

What about simple combinatorics? With 20 amino acids, there would seem to be over 3 million possible sequences of 5. With DNA there would be more, especially if you allow non-coding sequences. Of course, real genomes can be pretty big, but even so, real DNA sequences aren't even close to random. In fact, there are plenty of genes that are conserved (or nearly so) even across multiple species. Given all that, the numbers given look pretty low to me....

By David Harmon (not verified) on 03 Jan 2007 #permalink

Assuming neutrality, DNA sequence conservation does not persist very long. Aligning noncoding regions across mammals or across Drosophila (let alone between these two taxa) is quite difficult. Amino acid conservation lasts fairly long, but finding the same five consecutive amino acids between two diverse taxa within animals (not even out to non-animal eukaryotes or non-eukaryotes) will be rare.

The fact that these sequence are non-independent (they share a common ancestor) is important for determining the probability of finding a particular sequence at random amongst all available sequences from all species. But by sampling eukaryotes, archaea, and bacteria that lack of independence shouldn't be as important.

Allow me to ask the layperson's dumb question: what's to be gained from identifying sequences that do not occur in DNA or finding some that may be genomic poison?

Besides the fact that it's an interesting question, it would probably be useful to find sequences to avoid when transforming (genetic engineering) organisms. But the New Scientist article suggests a couple other possibilities:

1. Tags for DNA samples in forensic analysis. If you use a unique sequence to tag a sample from a suspect it reduces the chance it will get mixed up with a sample from the crime scene.

2. Using the "poisonous" amino acid sequences as "self-destruct" buttons for genetically engineered organisms. If you ever wanted to destroy all the organisms you'd turn on a gene with that sequence (apply an environmental stimulus that activates expression of the gene).

My one complaint ... they didn't even test their hypothesis ... and such an easyone to test at that. So now tons of money will be poured into this (from the DOD) before their ideas are validated ONCE? WTF?

apalazzo: The grant they already have, by the article, is on the development of the use of "primes" to tag DNA samples without contaminating them meaningfully.

This seems reasonable, and seems at least moderately supported already; after all, if those sequences were found in any known human, well, then they wouldn't be on the list. ;)

By Michael Ralston (not verified) on 05 Jan 2007 #permalink

it would be interesting to see if the 746 sequences of 5 amino acids that are not present in any known polypeptide are missing in synthetic phage display libraries containing random heptamers