 I'm always learning something from the readers of the Loom. Yesterday, I wrote about how scientists had inserted their names into a synthetic genome, and how such signatures would erode away like graffiti inside real organisms. But how about the opposite case--what if evolution has produced sequences of DNA that happen to form words?
I'm always learning something from the readers of the Loom. Yesterday, I wrote about how scientists had inserted their names into a synthetic genome, and how such signatures would erode away like graffiti inside real organisms. But how about the opposite case--what if evolution has produced sequences of DNA that happen to form words? 
In the comment thread, Peter Ellis asked,
What actually is the longest word (in any language) encoded by the reference human genome? If I had the time and computer power I'd have a look...
Guesstimate - it'll be somewhere in the 4-5 letter range, depending on letter frequency in the target language.
Bear in mind the rules of this game...the letters are the amino acids specified by codons (three bases of DNA). There are 20 amino acids in most living things, so you can't spell every word--or you can use alternatives, like using V for U. (Here's a table.)
Ron then replied:
Just wander over to NCBI and blast to your hearts content. Taking "gvesstimate" (note the classical spelling) and checking against the protein refseq database finds:
>ref|NP_939322.1| Putative peptide ABC transport system ATP-binding protein [Corynebacterium
diphtheriae NCTC 13129]
Length=560GENE ID: 2649530 DIP0959 | protein coding
[Corynebacterium diphtheriae NCTC 13129] (10 or fewer PubMed links)Score = 26.1 bits (54), Expect = 215, Method: Composition-based stats.
Identities = 9/11 (81%), Positives = 10/11 (90%), Gaps = 0/11 (0%)Query 1 GVESSTIMATE 11
GVESS I+ATE
Sbjct 278 GVESSEILATE 288
(sorry about the lack of proper formating)Knock yourself out. I do have vague recollections of someone doing something similar a long time ago, when the database was much, much smaller.
I had not heard about anyone trying this before, but it sounds like a lot of fun. I'm a complete novice when it comes to reading genomes with BLAST, so I won't try. But if anyone wants to post the longest word they can find, let's see what you get. (Maybe I'll get my word-guru brother to team up with a geneticist...that would be interesting.)
If you think about it, life on Earth is probably coming up with stray words in its many genomes, which then turn to gibberish (to our eyes), only to produce new words for us to find. The four-billion-year world search, as it were.
Update: Stephen Matheson offers easy step-by-step instructions. Thanks! Without a Z in the genetic code, I can't make an egotistic search for Zimmer. But here's Darwin lurking in bacteria.
 
ROFL! Oh gawd, dont tell Dembski!!
After his little infatuation with 'The Bible Code', he will be predicting the Apocalypse from Mus musculus sequence LOL!!
I don't know how to link to it directly, but people unfamiliar with the single letter amino acid abbreviations can consult the table here: http://en.wikipedia.org/wiki/Amino_acid
Better yet is IUPAC's version (as it has a few more letters and it is the official list):
http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21
Unfortunately, no official "O" or "J" (although hydroxyproline is often symbolized O).
Dembski is in the Drosophila genome! (I know, some people were hoping for rat)
>ref|NP_649329.2| CG7177 CG7177-PA, isoform A [Drosophila melanogaster]
Length=2352
GENE ID: 40391 CG7177 | CG7177 [Drosophila melanogaster] (Over 10 PubMed links)
Score = 24.4 bits (50), Expect = 476, Method: Composition-based stats.
Identities = 6/7 (85%), Positives = 7/7 (100%), Gaps = 0/7 (0%)
Query 1 DEMBSKI 7
DEM+SKI
Sbjct 170 DEMDSKI 176
(It's actually a perfect match because B is either aspartic acid or asparagine)
well, if I did it right, "EVOLUTION" is there (well, actually "EUOLUTION")
though I'm not really sure what I'm doing
sorry I meant "well, actually 'EVOLVTION'"
Well, who said that you have to BLAST against actual proteins? So why not just use TBLASTN instead, this should give a lot more results!
For the peace of mind of Alex Palazzo of The Daily Transcript, I checked both RESHAVEN and UNRESHAVEN - no hit.
Well, I'm just getting started on this, but I wanted to make sure I don't get scooped on my first big finding: CARL yields 100 hits in the human proteome, the vast majority of which are in the "immunoglobulin heavy chain variable region." How are your allergies, Carl?
(Oh, and STEVE is there too. But not STEPHEN. Bummer.)
I posted about this on our blog (though the server is down today ARGH), I used to play this game all the time.
Haven't found any in the human genome yet, but SEARCH and CHANGE are in the other genomes (zebrafish and a bacteria).
I'm a purist though, I believe you can ONLY use letters in the code. NO substitutions.
When I did my Ph.D. research, I found an 8 letter word once in my sequences (I had to do them the old fashioned, early 1990's way).
Oh, and I just remembered. We found an entire sentence once, well it was only four words I think and they were separated by other AA code letters.. but I read it!
But it's been over a decade. I really should go back to my stuff and see if I can find my notes.
Interesting point about words fading in and out of the genolexicon. With the advent of synthetic genomes, I am wondering if "they" will select/design ultra-high fidelity DNA polymerases to avoid mutation. Proof-reading is pretty desirable if you have a (myco-)bacterium that is already optimized to produce ethanol or pharmaceutical compounds.
Then you would have your static dictionary and built-in language police.
Found:
Darwin, Haldane, Fisher, Calvin
Not found:
Wallace, Dawkins, Dennett
Hmmm.
Fun! I found my last name in the human gene for DiGeorge syndrome, and in a couple of bacterial things. I tried my full name (first+last) but that is nowhere to be found. Although I found it with some conserved mutations in one of the bacterial genomes, so one could probably mutate that to contain my name without much effect.
Nice game!
Found SATAN in Leishmania, Trypanosoma, Vibrio cholerae, Trichomonas vaginalis, Yersinia, Listeria, Nicotiana tabacum, Vitis vinifera, Salmonella, Schistosoma, Anopheles gambiae, Aedes aegypti, Plasmodium vivax and ...
Homo sapiens!
(No sign of JESVS though)
Just thought to mention (and I'm sure this has been done elsewhere) that Yale's new biochem building has the structure of tyrosine-alanine-leucine-glutamate across the front portal: YALE.
Sorry Jesus, there is of course no J in the code, and IESVS is also almost omnipresent.
Here's an interesting aside about the message is in the Jurassic Park sequences...
http://www.ncbi.nlm.nih.gov/Class/FieldGuide/problem_set.html#BLAST
Everyone knows there is Elvis in the database, right? There is Presley, too, but they aren't together.
This game has been around for a long time. I remember reading a short article in Trends in Biochemical Sciences about 15 years ago, in which someone had compared the protein database SwissProt with the Oxford English Dictionary. If I remember correctly, the longest word they came up with was "ENSILISTS" (people who make silage).
There are more (outdated) examples at this link.
Hauke - there's no J in the standard 20-letter code, but J is sometimes used for (leucine OR isoleucine). In fact all 26 letters are used: B=(asparagine OR aspartate), O=pyrrolysine, U=selenocysteine, X=unknown, Z=(glutamate OR glutamine). However, you won't find these in protein sequence databases.
And let's not forget non-English words...
I'm glad there's no O in the standard single letter amino acid code. If there were, most likely the creationists would emphasize how many time "GOD" appears in the human genome, selectively ignoring all the other names that also appear with statistical predictability.
Hmmm..on the Jesus ones, you might want to try YSUA since that is the Hebrew for His name, more or less....
I think this game would be far more entertaining if we could divorce the specific amino acids from the letters already assigned to them, since it is sort of arbitrary that they would have certain names in English/Latin... and do some sort of statistical search for how many times certain words would show up under all possible correlations of the various amino acids with all possible letters....
Then you could search for SENTENCES and stuff, because the playing field would be much, much larger. And waste lots of processing time on it too :)
I found a pretty conserved GAYPIMP motif in one of my alignments once. Hilarity ensued.
All this is indeed old hat! Even more fun was generating things that could be encoded in DNA, but just hadn't been found ... yet. Over 20 years ago when I was at Berkeley I made a poster for the Molecular Biology department's annual Follies describing PLAV (Pinko Liberal Associated Virus), which encoded a protein with the sequence "SATANLIVESALLHAILSATAN". This 22 amino acid string was encoded by a 66 nucleotide motif, which was repeated 6 times and was therefor referred to as the 6(66) motif.
Still no hits, but "Life is too short, and DNA too long."
As has been noted, this is an old game indeed. You can read more about it in these articles and their descendants...
More Protein Talk
David Jones
PMID 8441464
Stephen Harvey had a letter to nature on the topic in 1993.
There is also a brief section about it in:
Origin of Life: The 5th OPTION by Bryant M. Shiller.
In the same thematic space, Clelland had a Nature paper about hiding messages in DNA microdots.
"LAMARCK" is in there.
Just happenned upon www's citing my book "Origin of Life: The 5th OPTION" in post #23 - The Genome as Word Puzzle... My book is the first systems engineering solution to the origin-of-life challenge. In reverse engineering the system of bio-life on our planet, I conclude that not only was bio-life a "rationally" designed system, (as opposed to ID's creationist intelligent design) - but that the only possible purpose it could serve is in the capacity of information time capsule. I suggest that a library of extranneous information (not derived from or having anything to do with genomics) is hidden somewhere within intron DNA waiting for some geneticist with an information theory background to discover it. Ironically, this signifies the primacy of genotype over phenotype, i.e. - the chicken is an egg's way of producing another egg - serving nothing more than as a sophisticated xerox machine, if you get my drift. The physical phenotype is only the "MATRIX" housing the information medium DNA in the service of the extranneous information encoded therein.