Basics: What is a gene?
Category: Genetics • Molecular Biology • Science
Posted on: January 16, 2007 1:28 PM, by PZ Myers
I mulled over some of the suggestions in my request for basic topics to cover, and I realized that there is no such thing as a simple concept in biology. Some of the ideas required a lot of background in molecular biology, others demand understanding of the philosophy of science, and what I am interested in is teetering way out at the edge of what we know, where definitions often start to break down. Sorry, I have to give up.
Seriously, though, I think that what does exist are simple treatments of complex subjects, so that is what I'm aiming for here: I talk a lot about genes, so let's just step way back and give a useful definition of a gene. I admit right up front, though, that there are two limitations: I'm going to give a very simplified explanation that fits with a molecular genetics focus (pure geneticists define genes very differently), and I'm going to talk only about eukaryotic/metazoan genes. I tell you right now that if I asked a half dozen different biologists to help me out with this, they'd rip into it and add a thousand qualifiers, and it would never get done. So let's plunge in and see what a simple version of a gene is.
First, let me cite a single source that I used to pull this out: Modern Genetic Analysis: Integrating Genes and Genomes(amzn/b&n/abe/pwll) by Anthony J.F. Griffiths, Richard C. Lewontin, Jeffrey H. Miller, and William M. Gelbart. It's an excellent genetics textbook, well worth the $116 if you've got the loose change jangling about. Here's their definition of a gene:
A gene is an operational region of the chromosomal DNA, part of which can be transcribed into a functional RNA at the correct time and place during development. Thus, the gene is composed of the transcribed region and adjacent regulatory regions.
So we have long strings of DNA organized into chromosomes in each of our cells, and certain portions of that DNA will be copied or transcribed into RNA strands by various proteins in the nucleus. Which parts will be transcribed will depend in part on what proteins are present in a particular cell; the proteins have to bind to specific regions in the DNA to initiate the protein machinery to do the work of copying, and that machinery also recognizes certain regions of the DNA as places to stop copying. We have approximately 25,000 genes; the emphasis is on the "approximately" because one of the ways we identify genes is by looking for the punctuation marks of the start and stop regions, and there's a lot of random punctuation scattered throughout the genome. The hypothetical designer must be a very poor copy editor.
Here's a simple picture of a eukaryotic gene.

It has a few general parts. It's on a strand of DNA, which you'll have to imaging going off the screen to the left and right for a few miles in either direction. There is a regulatory region for transcription initiation (more about that in a little bit) which, if we include various enhancers and repressors, may stretch for many thousands of base pairs, with important short areas for regulation scattered throughout; one serious flaw with this diagram is that the regulatory regions comprise roughly twice as much DNA as the coding regions.
The part of the gene that is actually transcribed is broken up into regions called introns and exons. Introns aren't going to be part of the final gene product, usually; enzymes are going to cut them out of the RNA and splice together those dark green exons to make the final functional RNA.
Let's look at a specific example of a gene. The Online Mendelian Inheritance in Man database makes it easy to look up human genes with known functional roles, and I arbitrarily picked CFTR, the cystic fibrosis transmembrane conductance regulator. Follow that link, and you'll learn far more than you ever wanted to about this gene that transports ions across cell membranes, and which is responsible for cystic fibrosis when it fails to work. It's not basic, I'm afraid.
One of the things you can do from that gene entry, though, is take a look at a graphic portrayal of the gene on the chromosome. CFTR is one gene among many on the long arm of chromosome 7. You can also find the nucleotide sequence for the coding region here. Isn't this informative?
1 aattggaagc aaatgacatc acagcaggtc agagaaaaag ggttgagcgg caggcaccca
61 gagtagtagg tctttggcat taggagcttg agcccagacg gccctagcag ggaccccagc
121 gcccgagaga ccatgcagag gtcgcctctg gaaaaggcca gcgttgtctc caaacttttt
181 ttcagctgga ccagaccaat tttgaggaaa ggatacagac agcgcctgga attgtcagac
241 atataccaaa tcccttctgt tgattctgct gacaatctat ctgaaaaatt ggaaagagaa
301 tgggatagag agctggcttc aaagaaaaat cctaaactca ttaatgccct tcggcgatgt
361 tttttctgga gatttatgtt ctatggaatc tttttatatt taggggaagt caccaaagca
421 gtacagcctc tcttactggg aagaatcata gcttcctatg acccggataa caaggaggaa
481 cgctctatcg cgatttatct aggcataggc ttatgccttc tctttattgt gaggacactg
541 ctcctacacc cagccatttt tggccttcat cacattggaa tgcagatgag aatagctatg
601 tttagtttga tttataagaa gactttaaag ctgtcaagcc gtgttctaga taaaataagt
661 attggacaac ttgttagtct cctttccaac aacctgaaca aatttgatga aggacttgca
721 ttggcacatt tcgtgtggat cgctcctttg caagtggcac tcctcatggg gctaatctgg
781 gagttgttac aggcgtctgc cttctgtgga cttggtttcc tgatagtcct tgcccttttt
841 caggctgggc tagggagaat gatgatgaag tacagagatc agagagctgg gaagatcagt
901 gaaagacttg tgattacctc agaaatgatt gaaaatatcc aatctgttaa ggcatactgc
961 tgggaagaag caatggaaaa aatgattgaa aacttaagac aaacagaact gaaactgact
1021 cggaaggcag cctatgtgag atacttcaat agctcagcct tcttcttctc agggttcttt
1081 gtggtgtttt tatctgtgct tccctatgca ctaatcaaag gaatcatcct ccggaaaata
1141 ttcaccacca tctcattctg cattgttctg cgcatggcgg tcactcggca atttccctgg
1201 gctgtacaaa catggtatga ctctcttgga gcaataaaca aaatacagga tttcttacaa
1261 aagcaagaat ataagacatt ggaatataac ttaacgacta cagaagtagt gatggagaat
1321 gtaacagcct tctgggagga gggatttggg gaattatttg agaaagcaaa acaaaacaat
1381 aacaatagaa aaacttctaa tggtgatgac agcctcttct tcagtaattt ctcacttctt
1441 ggtactcctg tcctgaaaga tattaatttc aagatagaaa gaggacagtt gttggcggtt
1501 gctggatcca ctggagcagg caagacttca cttctaatgg tgattatggg agaactggag
1561 ccttcagagg gtaaaattaa gcacagtgga agaatttcat tctgttctca gttttcctgg
1621 attatgcctg gcaccattaa agaaaatatc atctttggtg tttcctatga tgaatataga
1681 tacagaagcg tcatcaaagc atgccaacta gaagaggaca tctccaagtt tgcagagaaa
1741 gacaatatag ttcttggaga aggtggaatc acactgagtg gaggtcaacg agcaagaatt
1801 tctttagcaa gagcagtata caaagatgct gatttgtatt tattagactc tccttttgga
1861 tacctagatg ttttaacaga aaaagaaata tttgaaagct gtgtctgtaa actgatggct
1921 aacaaaacta ggattttggt cacttctaaa atggaacatt taaagaaagc tgacaaaata
1981 ttaattttgc atgaaggtag cagctatttt tatgggacat tttcagaact ccaaaatcta
2041 cagccagact ttagctcaaa actcatggga tgtgattctt tcgaccaatt tagtgcagaa
2101 agaagaaatt caatcctaac tgagacctta caccgtttct cattagaagg agatgctcct
2161 gtctcctgga cagaaacaaa aaaacaatct tttaaacaga ctggagagtt tggggaaaaa
2221 aggaagaatt ctattctcaa tccaatcaac tctatacgaa aattttccat tgtgcaaaag
2281 actcccttac aaatgaatgg catcgaagag gattctgatg agcctttaga gagaaggctg
2341 tccttagtac cagattctga gcagggagag gcgatactgc ctcgcatcag cgtgatcagc
2401 actggcccca cgcttcaggc acgaaggagg cagtctgtcc tgaacctgat gacacactca
2461 gttaaccaag gtcagaacat tcaccgaaag acaacagcat ccacacgaaa agtgtcactg
2521 gcccctcagg caaacttgac tgaactggat atatattcaa gaaggttatc tcaagaaact
2581 ggcttggaaa taagtgaaga aattaacgaa gaagacttaa aggagtgctt ttttgatgat
2641 atggagagca taccagcagt gactacatgg aacacatacc ttcgatatat tactgtccac
2701 aagagcttaa tttttgtgct aatttggtgc ttagtaattt ttctggcaga ggtggctgct
2761 tctttggttg tgctgtggct ccttggaaac actcctcttc aagacaaagg gaatagtact
2821 catagtagaa ataacagcta tgcagtgatt atcaccagca ccagttcgta ttatgtgttt
2881 tacatttacg tgggagtagc cgacactttg cttgctatgg gattcttcag aggtctacca
2941 ctggtgcata ctctaatcac agtgtcgaaa attttacacc acaaaatgtt acattctgtt
3001 cttcaagcac ctatgtcaac cctcaacacg ttgaaagcag gtgggattct taatagattc
3061 tccaaagata tagcaatttt ggatgacctt ctgcctctta ccatatttga cttcatccag
3121 ttgttattaa ttgtgattgg agctatagca gttgtcgcag ttttacaacc ctacatcttt
3181 gttgcaacag tgccagtgat agtggctttt attatgttga gagcatattt cctccaaacc
3241 tcacagcaac tcaaacaact ggaatctgaa ggcaggagtc caattttcac tcatcttgtt
3301 acaagcttaa aaggactatg gacacttcgt gccttcggac ggcagcctta ctttgaaact
3361 ctgttccaca aagctctgaa tttacatact gccaactggt tcttgtacct gtcaacactg
3421 cgctggttcc aaatgagaat agaaatgatt tttgtcatct tcttcattgc tgttaccttc
3481 atttccattt taacaacagg agaaggagaa ggaagagttg gtattatcct gactttagcc
3541 atgaatatca tgagtacatt gcagtgggct gtaaactcca gcatagatgt ggatagcttg
3601 atgcgatctg tgagccgagt ctttaagttc attgacatgc caacagaagg taaacctacc
3661 aagtcaacca aaccatacaa gaatggccaa ctctcgaaag ttatgattat tgagaattca
3721 cacgtgaaga aagatgacat ctggccctca gggggccaaa tgactgtcaa agatctcaca
3781 gcaaaataca cagaaggtgg aaatgccata ttagagaaca tttccttctc aataagtcct
3841 ggccagaggg tgggcctctt gggaagaact ggatcaggga agagtacttt gttatcagct
3901 tttttgagac tactgaacac tgaaggagaa atccagatcg atggtgtgtc ttgggattca
3961 ataactttgc aacagtggag gaaagccttt ggagtgatac cacagaaagt atttattttt
4021 tctggaacat ttagaaaaaa cttggatccc tatgaacagt ggagtgatca agaaatatgg
4081 aaagttgcag atgaggttgg gctcagatct gtgatagaac agtttcctgg gaagcttgac
4141 tttgtccttg tggatggggg ctgtgtccta agccatggcc acaagcagtt gatgtgcttg
4201 gctagatctg ttctcagtaa ggcgaagatc ttgctgcttg atgaacccag tgctcatttg
4261 gatccagtaa cataccaaat aattagaaga actctaaaac aagcatttgc tgattgcaca
4321 gtaattctct gtgaacacag gatagaagca atgctggaat gccaacaatt tttggtcata
4381 gaagagaaca aagtgcggca gtacgattcc atccagaaac tgctgaacga gaggagcctc
4441 ttccggcaag ccatcagccc ctccgacagg gtgaagctct ttccccaccg gaactcaagc
4501 aagtgcaagt ctaagcccca gattgctgct ctgaaagagg agacagaaga agaggtgcaa
4561 gatacaaggc tttagagagc agcataaatg ttgacatggg acatttgctc atggaattgg
4621 agctcgtggg acagtcacct catggaattg gagctcgtgg aacagttacc tctgcctcag
4681 aaaacaagga tgaattaagt ttttttttaa aaaagaaaca tttggtaagg ggaattgagg
4741 acactgatat gggtcttgat aaatggcttc ctggcaatag tcaaattgtg tgaaaggtac
4801 ttcaaatcct tgaagattta ccacttgtgt tttgcaagcc agattttcct gaaaaccctt
4861 gccatgtgct agtaattgga aaggcagctc taaatgtcaa tcagcctagt tgatcagctt
4921 attgtctagt gaaactcgtt aatttgtagt gttggagaag aactgaaatc atacttctta
4981 gggttatgat taagtaatga taactggaaa cttcagcggt ttatataagc ttgtattcct
5041 ttttctctcc tctccccatg atgtttagaa acacaactat attgtttgct aagcattcca
5101 actatctcat ttccaagcaa gtattagaat accacaggaa ccacaagact gcacatcaaa
5161 atatgcccca ttcaacatct agtgagcagt caggaaagag aacttccaga tcctggaaat
5221 cagggttagt attgtccagg tctaccaaaa atctcaatat ttcagataat cacaatacat
5281 cccttacctg ggaaagggct gttataatct ttcacagggg acaggatggt tcccttgatg
5341 aagaagttga tatgcctttt cccaactcca gaaagtgaca agctcacaga cctttgaact
5401 agagtttagc tggaaaagta tgttagtgca aattgtcaca ggacagccct tctttccaca
5461 gaagctccag gtagagggtg tgtaagtaga taggccatgg gcactgtggg tagacacaca
5521 tgaagtccaa gcatttagat gtataggttg atggtggtat gttttcaggc tagatgtatg
5581 tacttcatgc tgtctacact aagagagaat gagagacaca ctgaagaagc accaatcatg
5641 aattagtttt atatgcttct gttttataat tttgtgaagc aaaatttttt ctctaggaaa
5701 tatttatttt aataatgttt caaacatata taacaatgct gtattttaaa agaatgatta
5761 tgaattacat ttgtataaaa taatttttat atttgaaata ttgacttttt atggcactag
5821 tatttctatg aaatattatg ttaaaactgg gacaggggag aacctagggt gatattaacc
5881 aggggccatg aatcaccttt tggtctggag ggaagccttg gggctgatgc agttgttgcc
5941 cacagctgta tgattcccag ccagcacagc ctcttagatg cagttctgaa gaagatggta
6001 ccaccagtct gactgtttcc atcaagggta cactgccttc tcaactccaa actgactctt
6061 aagaagactg cattatattt attactgtaa gaaaatatca cttgtcaata aaatccatac
6121 atttgtgtga aa
That sequence will be translated into a protein in the cytoplasm of the cell, which will then go on to be incorporated into the membrane, where it will work to regulate the secretion of chloride and other ions. I confess, I'm often not so much interested in what the coding region of the gene does as I am in how the gene is turned on or off in the first place, so let's look at how that's done.
Here's another cartoon of a gene. The green part is the piece of DNA that is to be copied into an RNA transcript, and the piece of protein machinery that is going to do that job is called RNA polymerase, the pink rectangle. RNA polymerase is going to advance sequentially along the DNA, matching each DNA nucleotide on one strand with a complementary RNA nucleotide, and catalyzing the linkage of each RNA nucleotide to its neighbor. RNA polymerase needs to know where to start, though — it doesn't just land on a random part of the genome and start copying away — and it looks for a region called a promoter (in red).

Part of the promoter is a relatively simple sequence called the TATA box, because it contains lots of A and T nucleotides. The TATA box is bound by a whole constellation of transcription initiation proteins, though, building up a complex that promotes the binding and activity of RNA polymerase. The DNA itself has a 3-dimensional structure that folds around and allows sequences called enhancers and silencers to play a role in controlling transcription by way of intermediary proteins called activators and repressors. Turning on a gene is a family affair, requiring the participation of many proteins.

The molecular apparatus controlling transcription in human cells consists of four kinds of components. (The numbered proteins are the names of subunits of RNA Polymerase II. Each subunit is named according to its molecular mass in kilodaltons.) Basal transcription factors (labeled A, B, F, E H) are essential for transcription but cannot by themselves increase or decrease its rate. That task falls to regulatory molecules known as activators and repressors. Activators, and possibly repressors, communicate with the basal factors through coactivators—proteins that are linked in a tight complex to the TATA-binding proteins, the first of the basal transcription factors to land on the core promoter.
The really complicated part of the diagram above, of course, is that each of those colored blobs is a protein, which is in turn the product of expression of a gene elsewhere in the genome, which has in turn its own promoter and enhancers and silencers. The coding region in this cartoon could, for instance, be for one of the components of that RNA polymerase complex in action here. Genes can make gene products that affect the expression of other genes by binding to the regulatory regions or to the proteins that are involved in the regulatory complex.
One last thing: I also took a look at the other common web source for definitions of basic concepts, Wikipedia. Here's the first line of the Wikipedia entry for "gene":
A gene is the unit of heredity, with each gene determining one inherited feature of an organism.
That is completely wrong. "One gene, one character" is a false idea of the relationship of genes to inheritance, since many genes contribute to the appearance of a single feature, and one gene will play a role in many different features. Apparently, the next basic ideas I should summarize are polygeny and pleiotropy.





Comments
"One gene, one character" is a false idea of the relationship of genes to inheritance
Perhaps we need to compare it to a different sort of language. Anyone know of one where indifidual characters "do not" denote a specific sound, but only in combination? Can't think of any, unless you included something complicated like Unicode. But, in principle, I mean dealing with concrete meanings, i.e.:
a = means something alone.
e = doesn't.
ae = still doesn't.
ea = still doesn't.
But in each case the "sound" may change and each word that uses those has unique meanings. Sorry, closest I can think of. Though.. Maybe Mayan... I understand someone is finally breaking their knot code. Its a numerical system, sort of binary like, but with patterned knots and reversals, which change the specifics of the meaning and definition, though what some of those are is still unknown. A knot, in and of itself, lacks context, so doesn't even necessarilly qualify as a number.
Posted by: Kagehi | January 16, 2007 1:43 PM
One interesting point about gene structure and regulation is that metazoans do things differently than the other large groups of eukaryotic organisms (plants and fungi). For instance, promoter regions in plants seem to be much shorter than promoter regions in animals, (depending on where you sit on the definition of promoter, of course). And, of course, fungi have their own little weirdness. :-)
Posted by: Ron | January 16, 2007 1:53 PM
Where are all these definitions and explanatory articles going to be "stored"/made accessible/given a URL?
I think you may a=have said when the first idea came around, but a repeat - and every time one of these definitions comes up - would be a good idea.
Or are Seed, or someone going to set up a Wiki, with strict editorial controls?
Posted by: G. Tingey | January 16, 2007 1:56 PM
If we think the Wikipedia definition is wrong we should fix it.
Posted by: Rosie Redfield | January 16, 2007 2:06 PM
Cool beans. Nice job, PZ.
Anyone know where this dude can fill out a W-4?
:)
Posted by: Bob | January 16, 2007 2:11 PM
Nice post
Regarding the "one gene -> one character" issue. This brings up an interesting feature of genetics in the modern world. My wife (who sees biology much more from the perspective of the INSIDE of a cell than I do) and I were reviewing the Minn. Science Museum race DVD last night. (Written review coming to a blog near you in a few days). The narration mentioned that not the vast majority of genetic differences do not show up as trait differences. I sensed Amanda wince, but actually, we never got to talk about it becuase we got onto other things (the meaning of genetic variation a la Lewontin and this "85% of the variation happens in your own village" thing...)
If one third of mutations are in silent bases (same amino acid) then that means two thirds are visible. But only if you are looking at the protein.
In other words, from the iside the cell perspective, one gene = one trait (but wait, I have a caveat below). We just need to modify our understanding of "trait" to mean, most of the time "protein"
Caveat: I'm talking about the gene as an expressed entity. Zany things like transcribing a gene backwards or MHC genes that have dynamic genomes and stuff aside....
Posted by: Greg Laden | January 16, 2007 2:39 PM
At the risk of being derided again, PZ, could you please explain to a political scientist who never studied any science, the following :
Can you tell me in descending or ascending order of size the relationship between :atom, nucleus, cell, nucleus, molecule, gene, chromosone, amino acids, proteins and all the others I have forgotten.
I mean, I can read a post like the one on the Hox gene and it makes sense to me as I read but I need to know its connection to the other discrete things I read about. I know most people learn this in elementary school but for me that was 50 years ago in a Convent where Ladies did not Learn that Science stuff. Help.
Posted by: Suezboo | January 16, 2007 2:53 PM
This is fascinating stuff. True enough, the Wikipedia article does describe the gene in terms of "one gene = one feature". From my understanding of PZ's article, this could not be further from the truth. Unfortunately the issue has come up before on the article and a poster named Opabinia regalis removed the portions that explain how a gene can be removed from a phenotypic effect. This occured on Dec 10th, and went unchallenged by the Wikipedia community...evidence that many people don't understand genes very well.
By all means more informed posters than myself should get in there and rebuild that article. If a person types "gene" into Google, that page is the #1 result. it must be made to be correct...I'll help if I can but I don't think I am the expert here.
Posted by: JohnA | January 16, 2007 2:56 PM
Not even inside the cell, Greg. Mutations in a lot of genes have extensively pleiotropic effects; it rarely makes sense in any way to try to associate one and only one "function" with each protein. (Not to mention that with alternate splicing even the definition of "one expressed entity" gets fuzzy around the edges.)No matter how you slice it, "one gene - one trait" is a hopelessly outdated and inaccurate slogan.
Posted by: Steve LaBonne | January 16, 2007 3:02 PM
The ENSEMBL genome browser gives a much nicer overview (http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000001626). Klick one of the links under genomic locations for a zoomable map. I strongly encourage you to play around with all the features and links to other species. Have fun
Posted by: sparc | January 16, 2007 3:04 PM
One thing that is never made clear in characterizations of genes is: can they be discontinuous? i.e. does the biochemistry permit the following
gene 1 ------ gene 2 ------ gene 1
where the "two" gene 1s contribute to the same RNA strands and gene 2 to others?
Posted by: Keith Douglas | January 16, 2007 3:04 PM
Tend to hang around the Wikipedia articles strictly on Evolution and Gilbert and Sullivan-related subjects, but I've flagged up the problem with the Gene article on its talk page.
Posted by: Adam Cuerden | January 16, 2007 3:07 PM
"each of those colored blobs is a protein, which is in turn the product of expression of a gene elsewhere"
so its turtles all the way down...
Posted by: Bruce | January 16, 2007 3:10 PM
Keith- I'm not quite clear what your example is trying to say, but do a Google search on trans-splicing. It is possible for a messenger RNA to be stitched together from two separate transcripts. AKAIK most of the known examples are in "weird" organisms like trypanosomes. PZ probably is much more up to date on this than I am and hopefully will comment.
Posted by: Steve LaBonne | January 16, 2007 3:11 PM
But this *is* the functional definition of the gene as portrayed by Dawkins in the Selfish Gene. He essentially argues that biologists derive genes backwards from different phenotypes, and, elsewhere, that a gene is best understood as the potential difference between two otherwise identical organisms:
The metaphor matters. It must map to the more scientific defintion; otherwise, popular science books like Selfish Gene are inherently deceptive. I have to assume that Dawkins "believes" in "one gene, one character," even if he "knows" better.
Posted by: Griststone | January 16, 2007 3:14 PM
Not even inside the cell, Greg. Mutations in a lot of genes have extensively pleiotropic effects; it rarely makes sense in any way to try to associate one and only one "function" with each protein.
You are absolutely correct in what you say, but I was not clear in what I said, so your disagreement is about something else.
A gene codes for a protein (let's ignore talk of exceptions to this). If you only look at proteins, you will see that two identical genes code for identical proteins. Two different genes will code for two different proteins.
Lost of the time.
as I mentioned, there are cases where a single gene codes for more than one protein, but that does not obviate the incorrectness of "most genetic differences are invisible" from the protein perspective.
And the two proteins may not be different in how they function. But at the organism level, we do not say that heritable variation in a trait is not there if it does not have a function. It is still there.
Is this more clear?
I'm not actually advocating the use of the equation. Instead, I'm pointing out how it is interesting that this "antiquated" formulation is in some ways more true rather than less true, now that we have the technology to examine the proteins directly.
Posted by: Greg Laden | January 16, 2007 3:16 PM
gene 1 ------ gene 2 ------ gene 1
where the "two" gene 1s contribute to the same RNA strands and gene 2 to others?
Yes, sort of. Two genes can transcribe two different RNA's that are then joined together into a single molecule that translates into a protein. This is called Trans splicing.
That is not really two genes combining but it is two genes coding for what one would think by looking at it would be coded for by one gene.
In addition to that, many many proteins are made up of the primary products of more than one gene, stuck together to make a higher-order protein. That is very common.
Posted by: Greg Laden | January 16, 2007 3:20 PM
I wrote the original Wikipedia definition. I was hoping to produce a simplified definition that could be understood by the general public, since we had much criticism of the original article being far too technical. I've gone back to a transcript-based definition, since that will also cover functional RNA-encoding genes. This is indeed more accurate, but I worry that many people who are trying to find out what this "gene" thing their newspaper keeps talking about will stop reading immediately.
If anybody wants to help, please feel free to contribute.
Posted by: Tim Vickers | January 16, 2007 3:23 PM
Is it common to conflate a gene and its protein? You wrote ...this gene that transports ions across cell membranes... Isn't it the protein that does the transport, not the gene? Since, as other comments explain, genes and proteins are not one-to-one, why does this usage continue?
Posted by: jeff | January 16, 2007 3:36 PM
Greg- I still get hives at the thought of equating a protein with a "trait". So while I think I follow what you're trying to say, I still think the formulation in terms of "traits" is highly misleading to the uninitiated and is best left alone altogether.
Posted by: Steve LaBonne | January 16, 2007 3:38 PM
No, that bit of the article isn't mine and is quite wrong. This page needs a great deal of work and any contributions from people reading this blog would be most welcome.
Posted by: Tim Vickers | January 16, 2007 3:40 PM
Griststone- this is why I don't even like the word "gene" and wish we could somehow get rid of it (I know, fat chance). Dawkins is simply talking like a classical geneticist in that passage, and as PZ warns in his post the referent of "gene" in that world maps very imperfectly indeed onto what molecular biologists think of as a gene.
Posted by: Steve LaBonne | January 16, 2007 3:41 PM
Can you tell me in descending or ascending order of size the relationship between :atom, nucleus, cell, nucleus, molecule, gene, chromosone, amino acids, proteins and all the others I have forgotten...
Knowing full well I'm inviting a million quibbles for the exceptions and nuances I'm not going to cover, a (very) quick primer to set off your necessary Google searches...
First, descending order of size is difficult. There's some overlap, as some of those are classes of things. Proteins and genes are variable in size. And both are specific types of molecules, But roughly:
cell, nucleus (1), chromosome, gene is one easy sequence (large to small), and
protein, amino acid is another, and finally
molecule, atom, nucleus (2) is a last one.
... note that nucleus (1) and nucleus (2) are entirely different things. See the last thing below.
Quick descriptions:
Atom: one or more protons in combination with zero or more neutrons (the nucleus of an atom), surrounded by a cloud of electrons. The hydrogen atom has one proton, one electron, and (usually) zero neutrons. The oxygen atom has eight protons, eight electrons, and (usually) eight neutrons. Neutron counts vary within a given element; two elements with the same proton count and different neutron counts are called isotopes. I won't get into this further here.
Molecule: A combination of two or more atoms, chemically bonded to one another. Water is a chemical compound composed of two hydrogen atoms, each bonded to the same oxygen atom.
Organic molecule: Technically, just a molecule containing carbon (an element with six protons in its nucleus, and which can easily form interesting chains and branches and rings in a molecule). Organic molecules are interesting in biology because of that branching/chaining thing, and because most of biology concerns their interaction, probably again because of that branching/chaining thing. Inorganic molecules are also involved in biology, however. Water is technically an inorganic molecule (contains no carbon), but it's very important to biochemistry.
Amino acid: A class of organic molecule. There are about 20 or so different types in a typical orgnanism. Pictures should be available anywhere. They're mostly pretty small. But a chain of these is called a peptide chain, or a polypeptide. Typically, if the polypeptide is big enough (and they can get quite big), and biologically interesting, we call it a protein.
Protein: See above. A longish, usually biologically interesting chain of amino acids. Note that while they start, structurally, as chains, due to the different amino acids involved, which have various electronegativities and sizes, they tend to fold/clump/twist into interesting shapes as they're assembled. Biological organisms do a lot with them. A very large number of the enzymes that carry out metabolism are proteins. There are also structural proteins. Cells make them by reading mRNAs (see below) and translating each codon of three nucleic acids (also see below) into a corresponding amino acid, building the protein next to the mRNA. Google 'genetic code' for more.
mRNA. Messenger RNA. A single stranded RNA copied off a DNA template. Google 'nucleic acid complementary' to get an idea of what's going on there. Generally, an organism makes an mRNA from a DNA template, then a protein from the mRNA.
RNA. Ribonucleic acid. A big, chained molecule made up of a bunch of nucleic acids. RNA and DNA differ a bit in their 'backbone'... DNA is missing one of the oxygens at each link that you find in RNA.
DNA. Deoxyribnucleic acid. Like RNA, a big, chained molecule made up of a bunch of nucleic acids. Chromosomes are made up of DNA. Genes are physically DNA sequences.
Nucleic acid. Another class of organic molecule. The small things out of which RNA and DNA are made. There are five that appear between RNA and DNA: Adenine, Thymine, Guanine, Cytosine, Uracil.
Nucleus: one of two things. Either a bunch of protons and neutrons in the centre of an atom, or the central, membrane-bounded organelle in a eukaryotic cell in which you find all the DNA. The latter is vastly bigger than the former, as it's made up of a whole lot of molecules, themselves each containing the former. Both are pretty small, compared to, say, a fruit fly, tho'.
Posted by: AJ Milne | January 16, 2007 3:48 PM
Greg- I still get hives at the thought of equating a protein with a "trait". So while I think I follow what you're trying to say, I still think the formulation in terms of "traits" is highly misleading to the uninitiated and is best left alone altogether.
Right. I don't think we are disagreeing here. But I'll take another go at it anyway.
Let's say I want to study the evolution of the holes made by woodpeckers. That could have applications, but I really should also look at the beaks and the bodies of the woodpeckers.
But then, of course, someone is going to discover that what the beaks are made out of .. how woodpecker beaks develop ... is special and related to their hole-making behavior. And so on. At some point you get all the way back to the genes themselves. My original study of the holes assumed an underlying genetic pattern, of course, but I was happy with different hole shapes and position to be the trait. And those things are traits. Nonetheless, at some point you get to the genes.
I quickly add I am not being reductionist here. I'm sure you loose all kinds of stuff when you head down to the genes, not the least of which in this case is the tree the holes are in and the grubs the woodpecker is going after and the woodpeckers competitors, etc. etc.
Anyway, all these other, underlying things ... beaks, talons, keratin microstructures in the beak,etc. are all traits as well.
All the way down to the protein, which are on the last stop on this journey through traitland before we get to the gene (which is not a trait ... it's a gene).
So if you are looking at proteins, you see proteins. Not the other stuff. Not even beaks. In protein land, the statement that the film made "the vast majority of genetic variation is not observed in traits" is silly. That's all I meant.
So, I'm not equating a protein with a trait. I'm saying that at one very primary level a protein is a trait. But the protein is not the hole in the tree any more than the beak is.
Indeed, the VERY MOMENT a gene is expressed, the very FIRST thing that you get ... the most basic trait ... is always already a complex interaction of different elements including other genes. After all, the very expression of a gene in a multicelled organism depends on what cell it is in, and that was determined already by a process that involved numerous genes.
I quickly add that even at that level, the gene-> trait link is too simple (as I've been noting all along).
As far as being misleading to the uninitiated, so far I don't think we've come even close to dealing with that! PZ's post is too much. Tim's definition in Wikipedia is too little.
And this brings us to the crux of the problem. You can't do it. Explaining "gene" simply is like explaining "the details of the antilock breaking system" simply during a TV commercial for a new car. Simply can't be done.
What is needed is a recognition of the complexity when the whole gene thing is invoked in the media.
Posted by: Greg Laden | January 16, 2007 3:55 PM
Knowing full well I'm inviting a million quibbles for the exceptions and nuances I'm not going to cover ...
No kidding!!! But you did a great job. Dividing the problem into different lists was key. Brilliant.
Posted by: Greg Laden | January 16, 2007 3:57 PM
OK, the new Wikipedia definition, comments?
"A gene is the unit of heredity and genes determine inherited features. In an organism, the set of genes in the genome interact to direct physical development and behavior. Genes are nucleic acid molecules such as DNA or RNA, and carry information. This information is contained in the base sequence of the nucleic acid molecule."
Posted by: Tim Vickers | January 16, 2007 4:02 PM
AJ Milne, a million thanks. At least I am starting to get these basic structures straight in my thinking. Thank you for taking the time to explain all this to a total rookie.
I think part of my problem is the specialisation. Like PZ is into development at a molecule level (right??) - molecular biology - and I have read books about what happens inside a cell - nuclear physics (??). But what I need is an elementary book that shows me the whole picture from the proton to the environment (ecology??). Is there such a book anyone can recommend? Otherwise I feel like I am floundering around.
Posted by: Suezboo | January 16, 2007 4:05 PM
Suesboo--
AJ Milnes' response was right on, but a "purely verbal" description can be difficult to visualize...
The following approach may not assist you with integrating the emergent properties found at the different levels of nature, from quarks up to quasars, but it is a fun and visual way to start.
Google "Powers of Ten." The first results page will display several different versions--slide shows, web pages, java animations--to assist you with visualizing nature at various scales based on multiplying "up" or dividing "down" by ten.
This may help you begin to weave together or overlap a visual AND and a verbal sense of what "fits" within what: how sub-atomic particles make up atoms of the different elements, which combine into the molecules of inorganic chemistry (which combine into the solids, liquids, gases and plasmas at all scales of the physical world, from individual crystals and motes to rocks, oceans, atmospheres, planets, stars, and galaxies); the atomic elements also make up simple "organic" molecules, which can then hook up into the lengthy twisted chains and side-chains of organic molecules (amino acids, nucleic acids, proteins), which are the constituents of organelles and cells (and the intracellular medium) in living critters, which form into the tissues and organs of multicellular lifeforms, who can concregate into couples, herds, sects, and societies...
From there, if it helps, feel free to google whatever topics this approach suggests.
It's so refreshing to come across people who actually want to LEARN about this amazing universe, rather than flaunt the false security of their ignorance.
Posted by: Steviepinhead | January 16, 2007 5:21 PM
There probably are such books, but I don't know any...
No, cell biology/molecular biology/...
BTW, keep in mind that proteins and DNA strands are humongous molecules made from the combination of smaller molecules (with water as a byproduct in both cases).
Posted by: David Marjanović | January 16, 2007 5:40 PM
"Powers of Ten" is good, but quite brief.
Posted by: David Marjanović | January 16, 2007 5:42 PM
I think this was an exceptional posting.
I note with interest that, as I write this, I find a total absence of the normal Pharyngulean ideological or trollish comments. Ahem.
Another point of interest is that, (intentionally?) phrasing such as:
the proteins have to bind to specific regions in the DNA to initiate the protein machinery to do the work of copying, and that machinery also recognizes certain regions of the DNA as places to stop copying
is ideal for naive "quote mining".
Posted by: John | January 16, 2007 6:10 PM
Keith Douglas asked:
Yes.
Refer to PZ's first figure above. See where a gene can be divided into coding regions (exons), and intervening regions (introns)? When the gene in that figure gets copied into RNA, the introns get cut out of the RNA copy, and the exons get connected back together to form one continuous piece of RNA. That 'spliced' RNA is the actual template for making a protein. So, the individual exons in the gene correspond to your "gene 1s."
Interestingly, it's quite possible for a different gene to be embedded within one of the introns. That would correspond to your "gene 2."
Search for terms like "overlapping genes" or "nested genes" if you want examples.
Posted by: qetzal | January 16, 2007 6:11 PM
There is also a simulation with "Cells Alive" called "how big" that starts with a hand holding a pin. Then you zoom in and on the pin are a mite (I guess) and a dot. Yo zoom in and the dot turns out to be some pollen and other stuff, and as you szoom in the other stuff turns out to be a lyphocyte, and a bacterium next to a walking stick.
Then you zoom in and the walking stick is an Ebola virus with some crud next to it.
Then ... you guessed it ... a rhinovirus.
So this is good for the larger end of the spectrum (compared to molecules)
Posted by: Greg Laden | January 16, 2007 6:12 PM
David - I can't believe I made a mistake right there ! I actually meant "inside the atom" but my mind is buzzing with all these terms.
Stevie - Thank You - that is all clear to me ! Amazing ! I glanced at Powers of Ten and it may be just what I need. I can google on from there.
Thank you. Retired politicos (like me and Al Gore) need to learn some science.I do enjoy learning this wholly new stuff so much.
Posted by: Suezboo | January 16, 2007 6:17 PM
Genes? Nonsense. It's turtles all the way down!
Posted by: Tukla in Iowa | January 16, 2007 6:48 PM
The NCBI bookshelf has free, searchable biology textbooks at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books.
It has the previous version of Modern Genetic Analysis,
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mga.section.132
Posted by: Jim Lund | January 16, 2007 6:52 PM
You're welcome Suezboo, and thanks, Greg.
Posted by: AJ Milne | January 16, 2007 7:12 PM
So, does anyone know the rationale behind calling the excised bits introns and the included bits exons? I just realized this past semester that I'd been teaching them backwards, and I have no idea how long ago I snapped to the logical connections from the actual ones.
Good news is it probably doesn't matter - no one's remembering details like this from their first semester anyway...
Posted by: Darby | January 16, 2007 8:31 PM
The original post is excellent. Combined with the comments from non-experts, experts, and the tie-in to improving the Wikipedia article it is truly the best use of the blogging for science education. Thanks to everybody. I learned a lot.
This -- if the comments are included -- is certainly a candidate for the next ScienceBlog compilation.
Posted by: AndyS | January 16, 2007 9:01 PM
The terms were invented by Walter Gilbert. I think the introns are called introns because they are confined in the loop that is spliced out of the RNA.
It does sound backwards though. Try "The INtrons are thrwon IN the trash leaving only the EXellent EXons.
Posted by: Greg Laden | January 16, 2007 10:07 PM
This doesn't tell me anything about poutine and sushi!!
Posted by: Kevin Bryant | January 16, 2007 10:40 PM
Dear all,
I don't understand the fuss about defining genes. IMO all definitions are operational, i.e they are formulated in a way that one can work with them. Indeed, genes have been first identified as phenotypic traits and for many questions (e.g. description of the inheritance of human genetic diseases) this approach is still valid. Of course there have been many addititons to the original definition and on the molecular level genes are much more complicated. However, all the different gene definitions can only be understood in the light of the biological knowledge at the time they were coined.
If one wants to be picky one will find several points in PZ's post that for which exceptions are known. Some examples:
What about intron retention in alternative splicing when a non-spliced intron becomes part of the transcript? Does this deny the definition of introns and exon. What about overlapping genes? What about TATA-less promoters. What about the differences of PolI, II and III genes. What about promoter definitions? Where do they start and where do they end? Have transcription termination points for mammalian genes been identified finally (I did not follow this)?
Come on, this is biology.
So you will find an exception for any seemingly well defined issue. And for me this is what makes biology fun. One just has to know which definitions are appropriate for the issues one is working on.
Posted by: sparc | January 16, 2007 10:52 PM
BTW, the sequence above is not the CFTR cds but part of its mRNA sequence that contains the cds and parts of the 5' and 3' UTRs. You may search the correct ORF by looking for initiation and stop codons. Or click on the link PZ has given and look for cds on the page you are directed to.
Posted by: sparc | January 16, 2007 10:56 PM
My definition of a gene is simply "a DNA sequence that is transcribed." I don't include regulatory regions in my definition of a gene. Those regions control the expression of the gene but they aren't part of the gene itself.
It's like the a car and the keys to a car. The keys aren't part of the definition of the car.
I've got a little essay on the definition of a gene that includes all of the problems and exceptions. Maybe I'll post it to give a different perspective. My definition includes prokaryotic genes. :-)
Posted by: Larry Moran | January 16, 2007 11:08 PM
My definition of a gene is simply "a DNA sequence that is transcribed." I don't include regulatory regions in my definition of a gene. Those regions control the expression of the gene but they aren't part of the gene itself.
This is good because it is in line with the gene-phenotype thing so it is a useful way to characterize a gene. However, why stop there? Why not say:
"A sequence of DNA that is transcribed and translated"
This way we also get rid of the RNA templates from the definition, which is kind of handy.
Posted by: Greg Laden | January 16, 2007 11:19 PM
Well, sure...this is MY definition, though, and since I'm most interested in a) regulation and b) metazoans, that tends to bias my perspective. In fact, you read some development papers, and the part of the gene you emphasize in your definition isn't so important -- you can replace it with β-gal and it's still interesting. I think my second paragraph admits that there are lots of other good ways to look at a gene.
Translation shouldn't be part of the definition. There are too many interesting bits of RNA floating around that don't need to be translated to have a function -- we don't want to exclude them.
Posted by: PZ Myers | January 16, 2007 11:34 PM
Translation shouldn't be part of the definition. There are too many interesting bits of RNA floating around that don't need to be translated to have a function -- we don't want to exclude them.
I'm not interested in EXCLUDING them. They will be allowed to remain. But how they get there and what they do is very different than what proteins do. They are probably linked to fitness in a different way than protein coding genes.
By including translation, you get to say "A gene codes for a protein" without anybody getting in a tizzy. That is a powerful tool in pedagogy. (you could still say it but it would be yet one more white lie among so many that already exist).
But, as you say, that's your definition. I will simply have to post my own if that's the way it's going to be...
Posted by: Greg Laden | January 16, 2007 11:38 PM
Posted by: sparc | January 16, 2007 11:54 PM
For a completely operational definition of 'gene' I've always been fond of George Williams' (1966):
... this is a conceptually challenging encapsulation even when one knows that the endogenous changes he's referring to are those like crossover and mutation. But it's a useful and generalized definition, and so was cited by Dawkins in his Extended Phenotype book, and at least implicit throughout the Selfish Gene - which nowhere details any specific gene. It didn't have to - this conceptual gene suffices.
Posted by: thwaite | January 17, 2007 12:17 AM
In addition Larry: What about the genes encoded by RNA viruses?
Posted by: sparc | January 17, 2007 12:19 AM
In addition Larry: What about the genes encoded by RNA viruses?
Good point, but it may be a little like saying that a definition of the digestive system has to include tape worms.
Posted by: Greg Laden | January 17, 2007 1:29 AM
Not to be daft, but don't these operational definitions fail to account for theory? What is interesting about a gene is what it does, and how, not how much of a DNA snippet it is. Circularities might be nice for those working within specific disciplines, but they don't do much to justify ideas about how genes relate to heritable traits and natural selection. After all, I can adopt an operational definition of the solar system that says the sun goes around the earth every 24 hours, and it will get me home before dark every time, but it won't be true, and I'll have a bitch of a time landing a rocket on the moon.
Posted by: Griststone | January 17, 2007 2:06 AM
Ah, so that's what an intron is... no chance of turning Mr. Barclay into a giant spider, then? (Or was that something else?) Yes, I fear I got way, way too much of my introduction to scientific notions from Star Trek.
Yeah, that Powers of Ten thing is spiffy... it reminds me of a book I had as a little kid, called "Cosmic Scale" or "Cosmic View" or something like that. It starts at 1-1 scale with a picture of a kid sitting on a hill, then zooms out... then goes back to the kid, zooms in to the mosquito sitting on his hand, then the blood cells, and so on until it shows an atom; this was a kids' book from the eighties, so it stopped there, didn't delve into the nucleus.
Posted by: lytefoot | January 17, 2007 2:08 AM
The concept of the gene is certainly fascinating and the history of the concept has involved some of the truely great thinkers in biology. Harvey, Goethe, Bateson, Boveri, etc etc. Seems like the big problem with the concept is that everyone defines it diferently depending what branch of biology you are working in. The best book I have read on the subject is Jacob's Ladder by Henry Gee. Lots to disagree with in there but a hell of a read and a real page turner.
Posted by: Chris Surridge | January 17, 2007 5:06 AM
Posted by: sparc | January 17, 2007 5:13 AM
It took a long time for someone to mention RNA virus encoded genes, or even functional non (protein)coding 'genes'.
So a microRNA encoded by an RNA virus is not a gene ?
What is it then ? Are a lot of us wasting our time doing lab experiments on such entities ?
The central dogma (DNA - mRNA - protein) is easy for teaching purposes but is not really the best current understanding of what constitutes a gene.
In my opinion a gene is a set of instructions for the production of an expressed biological product with a functional effect.
Regulatory elements such as promoters or enhancers, which also encode information, are not expressed and are usually not included in the definition of 'gene'. To date this information has only been found in nucleic acids although some other biomolecule(s) may have predated this in evolutionary history (and may still exist if we look hard enough or in the correct places).
Posted by: MartinC | January 17, 2007 5:20 AM
Matt Ridley's book, "Nature via Nurture" -- which may have been given a different title in the States -- has seven different definitions of "gene" in the course of a very clear discussion.
But that presentation was beautiful, PZ, and an example of all that's best about pharyngula.
Posted by: Andrew Brown | January 17, 2007 6:08 AM
sparc says,
What? The prediction programs can find "genes" whether or not you include regulatory sequences in the definition. I don't understand your objection. Are you saying that if we remove regulatory sequences from the definition of a gene then we can't use them to locate the real gene?
That doesn't make sense.
Posted by: Larry Moran | January 17, 2007 6:20 AM
MartinC says,
That's because your version of the Central Dogma is wrong! :-)See Basic Concepts: The Central Dogma of Molecular Biology.
As for RNA genes, they are an exception to the rule. There are lots of exceptions. This is biology. Biology is messy.
I'm going to try and post something on the definition of a gene in order to explain the complications. John Wilkins can help since his current boss is one the the world's experts ont the topic.
Posted by: Larry Moran | January 17, 2007 6:27 AM
We may not be able to agree on a common gene definition. However, the discussion proves that commenters here are able and indeed willing to question definitions rather then just accepting arguments from PZ's authority. Quite entertainiing and instructive compared to the lame self-referential, reverential 'discussions' on UD.
Unfortunately though, I guess we who took part in this discusssion may benefit more then those guys who may be redirected to this thread during the search for some basics in the future because we had the privilege to see the discussion develop. I don't know how effective this thread will be for people who will just read it in the future.
Posted by: sparc | January 17, 2007 6:32 AM
Larry, its not exactly my version of the central dogma, its essentially the 1965 Jim Watson textbook version and yes, of course it is now seen as an outdated way of seeing information flow in biological systems.
However, RNA genes may be exceptions to rule ?
I completely disagree.
They may not be DNA encoded but the question is what is a gene, not what is the currently best understood or most common form of the gene.
Just because us DNA based individuals have taken over the establishment doesnt mean we should look down on our probable RNA based ancestors as somehow geneless.
Posted by: MartinC | January 17, 2007 7:18 AM