Pharyngula

The genome is not a computer program

The author of All-Too-Common Dissent has found a bizarre creationist on the web; this fellow, Randy Stimpson, isn’t at all unusual, but he does represent well some common characteristics of creationists in general: arrogance, ignorance, and projection. He writes software, so he thinks we have to interpret the genome as a big program; he knows nothing about biology; and he thinks his expertise in an unrelated field means he knows better than biologists. And he freely admits it!

I am not a geneticist or a molecular biologist. In fact, I only know slightly more about DNA than the average college educated person. However, as a software developer I have a vague idea of how many bytes of code is needed to make complex software programs. And to think that something as complicated as a human being is encoded in only 3 billion base pairs of DNA is astounding.

Wow. I know nothing about engine repair, but if I strolled down to the local garage and tried to tell the mechanics that a car was just like a zebrafish, and you need to throw a few brine shrimp in the gas tank now and then, I don’t think I would be well-received. Creationists, however, feel no compunction about expressing comparable inanities.

I actually have some background as a software developer — I wrote some lab automation and image processing software that was marketed by Axon Instruments for several years — and I can tell you as someone with feet in both worlds that the genome is nothing like a program. The hard work of cellular activity is done via the chemistry of molecular interactions in the cytoplasm, and the genome is more like a crudely organized archive of components. It’s probably (analogies are always dangerous) better to think of gene products as like small autonomous agents that carry out bits of chemistry in the economy of the cell. There is no central authority, no guiding plan. Order emerges in the interactions of these agents, not by an encoded program within the strands of DNA.

I’d also add that the situation is very similar in multicellular organisms. Cells are also semi-independent automata that interact through a process called development in the absence of any kind of overriding blueprint. There is nothing in your genome that says anything comparable to “make 5 fingers”: cells tumble through coarsely predictable patterns of interactions during which that pattern emerges. “5-fingeredness” is not a program, it is not explicitly laid out anywhere in the genome, and it cannot be separated from the contingent chain of events involved in limb formation.

That’s a difficult and abstract concept that’s hard to get across to students who are seriously studying the subject, let alone ignorant creationists who have no awareness of the biology. This guy, though, knows one thing and one thing only — how to write software — and digs his hole deeper and deeper.

To be more specific, since DNA alphabet consists of 4 nucleobases, we can represent a nucleobase with 2 bits data. This means that 4 base pairs can be represented by a byte of data and approximately 4 million base pairs can be represented by a megabyte of data. This means that the entire human genome can be represented by only 750MB of code. From my experience as a software developer, this would have to be highly efficient code. To suggest that 97% of DNA is junk implies the implausible — that less than 23MB of DNA is not junk. By comparison, Microsoft Word has a size of 12MB.

The genome is not code, efficient or otherwise. Sure, you can tally up the bits needed to store the sequence in a database, but that is not the same as saying you’ve got the complete information for an organism, or that you have captured the “code” that can be executed to build it. Rather than realizing that maybe his analogy is faulty because it leads to conclusions he finds unlikely, this creationist is so convinced of the accuracy of his analogy that when he finds it leads to incomprehensible results, he decides that biology and the reality of the genome must be wrong.

I think it’s more probable that the human DNA which we have discovered so far doesn’t contain all the information required to produce humans. I wouldn’t be suprised if more DNA, or some other kind of information, is discovered some time in the future.

Many of you may have seen this infamous creationist quote, which is a perfect example of an oblivious ignoramus overlooking the obvious.

One of the most basic laws in the universe is the Second Law of Thermodynamics. This states that as time goes by, entropy in an environment will increase. Evolution argues differently against a law that is accepted EVERYWHERE BY EVERYONE. Evolution says that we started out simple, and over time became more complex. That just isn’t possible: UNLESS there is a giant outside source of energy supplying the Earth with huge amounts of energy. If there were such a source, scientists would certainly know about it.

Stimpson hasn’t said anything quite that stupid, but it’s only because biology and developmental biology are so much more subtle and harder to observe and understand than the existence of a giant thermonuclear furnace burning furiously 93 million miles away. There is no significant source of extra DNA, but there is additional information generated by the activities of cells during ontogeny. This concept, that the starting material is not the complete final product, but that it requires ongoing input from the environment and from continuing negotiation and activity within the starting material to generate novel features, has only been around for about 2300 years, since at least Aristotle, so I guess I shouldn’t be surprised that a creationist would be a few millennia behind. The concept is called epigenesis. It’s essential to understanding how a genome generates an organism, and you shouldn’t try to force your analogy onto biology if you don’t understand it.

But wait! Ignorance is no obstacle to a devout creationist, and Mr Stimpson continues his headlong descent into unchecked failure in another post, in which he tries to claim that there is negligible junk DNA.

Now there are 210 know cell types in the human body. I’ll assume that each cell type requires at least 1MB of information. These cell types share a lot of common features so I’ll assume there is a lot of common information. Just how much of the information is shared between these cell types is a guess. I am going to assume that 90% of the information in each cell type is shared and 10% is unique. This means that 210 cell types require 1MB + 209 * .1MB of information. Rounding this implies that there is at least 22MB of information in the human genome.

None of this makes any sense whatsoever.

Where does Mr Stimpson get this magic number of 1MB of information for a cell type? He seems to have pulled it out of his butt.

What does he mean by “information”? He blithely equates the information in a cell with a measure of the number of nucleotides in its DNA. This is not valid. Cells have developmental histories that are essential elements in describing their state.

This “210 cell types” number is a widely used value that was taken from a 1960s paper that itself was making only a broad guess from descriptions in histology text. I’ve griped about this oft-used and ultimately bogus number before, but I can’t blame Stimpson for using it … but it really needs to be purged from the literature.

I don’t know what the hell he’s babbling about when he tries to partition subsets of the genome into unique stuff for different cell types. It doesn’t work that way! The entire genome is present in every cell (with some narrow exceptions), and genes get reused in multiple functions in multiple cell types. His whole conclusion is a beautiful example of garbage in, garbage out.

Let’s see how much deeper into the muck this guy can sink…

But this is just the information needed to construct the different cell types. More information is needed for spatial orientation and to coordinate activity among cells to perform complex functions like vision, motor control, digestion and tissue repair. Since the most efficient algorithms to just sort n objects have an order of nlog(n) I am tempted to guesstimate by multiplying 22MB by log(210) to get a lower bound. But that would be bad applied math and just plain lazy. But then again I am not exactly getting paid to do this (wink).

This is a transparent revelation of his biases. He thinks there needs to be some nuclear authority that specifies higher level activities like tissue repair and vision — there isn’t! There is no map. There is no boss. There is no blueprint. Vision is an emergent property of cells and proteins interacting in development, using tools shaped by four billion years of history. Your preconceptions are not data.

It isn’t a Stimpson post without some statement that is jaw-droppingly obvious.

I can think of two other approaches that could be taken. For one of them I need some data points. In particular I need size data about genomes of the simplest multicellular life forms that are well studied and believed not to have junk.

You mean, you ought to have some data underlying your speculations? Whoa. Who could have imagined that.

This is exactly what biologists have been doing for the past century: gathering data and building explanations from the evidence. Isn’t this how anyone with any sense would recognize that this is how science proceeds?

Oh, and this whole series of posts has been written because Stimpson doesn’t like the idea of junk DNA, another common creationist preconception that they anguish over. Again, it’s the evidence that supports the idea of junk DNA, and creationist ignorance does not counter it.

Let’s start with something relatively simple for Mr Stimpson. Look up LINEs and SINEs. Long Interspersed Nuclear Elements (LINEs) are pieces of DNA that code for an enzyme that copies RNA (including the RNA for itself) back into the genome. SINEs (Short Interspersed Nuclear Elements) are shorter sequences that don’t code for a functional protein, but their RNA is recognized by the LINEs and gets copied back. These sequences do not play a specific role in the economy of the cell, although they do certainly represent a generic drain on cellular metabolism. Mr Stimpson should try to explain the function of these sequences.

Then he can try to explain why his DNA contains 870,000 copies of LINE — taking up 20% of his genome — and 1.5 million copies of SINE. I’d also like to see his software development analogy to these. Back when I was writing software, I don’t recall writing small segments of self-modifying, recursive code and sprinkling them throughout my program…or perhaps more accurately, writing software that was one-third auto-loading noise maker and sprinkling a few words of functional code among them.