Why don't we finish the human genome first?

One of the interesting things I learned today was that many people are calling for the genome sequences of the chimps and Macaques to be finished.

This is especially amusing because the human genome isn't quite done. We're primates, too! Why not finish our genome?

[I blame these new-found revelations on Twitter. Despite my youngest daughter's warning that only old people use Twitter, I've joined my SciBlings and taken the plunge. (you can even follow me! @digitalbio).

Now, I get to indulge my geeky tendencies while waiting in line at the grocery store. I just type #cshl and voila! I get the low down on the best and boringest talks at the Cold Spring Harbor genome meeting.]

What do you mean the human genome isn't done!

Yes, I know there was a press conference a few years ago so people could congratulate themselves for having finished sequencing the human genome. And, I suppose the main characters were finished with sequencing the human genome.

But finished â  complete and the word "done" is rather subjective. I know I define "done" differently sometimes than my children.

"Done" doesn't mean we know the entire sequence.

i-61681f17564f1f96e8c4fcc4ddc29222-Genome.png
Figure 1. The human genome all done up from Genome Reference Consortium

This image shows the parts that are done in blue and the parts that are not done in black. Presumably, medically unimportant genes (if any) map to those black, unfinished parts.

The page also has a funny note saying "Next Build Release Spring 2009" Is it Spring yet?

I know there were good reasons for doing the parts that got finished and leaving the other parts out of the definition of "done." But DNA sequencing has come along way from the days when people used to joke about sentencing errant post-docs to concentrated sequencing camps.

But if there's a commitment to finishing a primate sequence, can we get our genome done too?

More like this

Hey Sandra,

Well... the non-human primate genomes could be substantially improved with a few lanes of Solexa, whereas the remaining human regions are virtually impenetrable to short-read sequencing.

So while finishing human is important, the benefit/cost ratio is probably higher for spending a little extra effort on the primate sequences right now.

Hi Daniel,

Wouldn't it be helpful though to have a reference human sequence for comparison? It seems like we should be able to do it now.

Did those regions just get ignored when Venter's genome was done?

Hey Sandra,

Another complication with finishing any genome, is that there is no single reference genome that represents a complete species. Look at the MHC haplotypes for example... Who knows we might actually have to abandon the whole notion of a reference sequence as it is defined at the moment. Maybe we'll get >100 reference sequences or so, or maybe a virtual one that combines all possible sequence variants into one reference. That virtual human genome might then be >100Gb, but would only serve as a scaffold to attach "real" genomes to.
It'll be interesting to see what the future brings.

And I concur with Daniel (probably because I work on the great apes :-)

They weren't ignored, they simply couldn't be assembled: a fair chunk of those regions are almost as impenetrable for BAC/capillary sequencing. Most of those black segments are the extraordinarily repetitive heterochromatic regions around centromeres, or the even nastier segments on the Y chromosome.

But should we invest time putting together a complete human reference assembly? Absolutely - and not just spanning the repetitive regions, but also including all of the novel sequence found in some individuals but not others. There was some talk at the 1000 Genomes meeting earlier this week about this, and I think it will happen sooner rather than later.

Should we also encourage a sequencing centre to devote a couple of lanes of their spare sequencing capacity to increasing coverage of chimp/macaque/other primates? Given that one lane of Solexa now gives ~1X coverage for a primate genome, it would be criminal not to.

Hey Sandra,

You should learn, and write, about that 1000 human genomes project. It is an international project. Venter, of course, is involved.

You should also learn that there are a few "finished" human genomes already.

The difficulties for truly finishing the genome, as some explained above, well, are hard for now. So, no harm done if we insist on other primates to be sequenced. Since we are so similar, I doubt those genomes will get to a better "finish" than the human ones. But what the heck.

Best,
--Gabo

Oh, I was forgetting.

There is a recent article comparing a "human cancer genome" to other human genomes. Venter's and Watson's.

Best again,
--Gabo

(maybe science or nature)

Thanks Gabo,

I am interested in the 1000 genomes project, too.

For the past few months, I've been pretty focused, work-wise, on transcriptome analysis. That's why I would like to see a reference genome, or maybe several reference genomes get completed so we could use them for aligning Next Gen data.

Plus, I think human DNA would be easier to collect.

It is interesting- the reasons that have been given here as to why the human genome are not complete are exactly the reasons why we should do it! How else will we learn to deal with these issues?

At least they got rid of the vector sequence data that was assembled into the first few drafts!

Gabo:
There are no finished human genomes. The data being produced in the 1000 genomes is largely of the short-read variety and will be used to align to the public reference assembly to identify differences, but the data will not be assembled as such. This is true of all of the genome sequences that have been reported, save the Venter genome.

Sandra:
Sorry it took a while to get the GRC pages updated, we had some technical difficulties, but they are updated now. We are trying to use data from the Venter assembly, as well as other sources, in order to improve the reference. We even have a way to collect assembly problems from users!

The seeming contradiction in wanting to "finish" the genomes of chimp and macaque versus finishing the human is simply in the definition of the word "finish". We now have random shotgun sequencing for several different humans, with many more coming, in aggregate comprising fantastically deep coverage of the human genome. No other genome of comparable size has anything close to as complete coverage and analysis as the human genome. The few remaining gaps have not been closed because: (1) It is not simply a matter of sequencing, but instead is the resolution of vast tracts of repeated sequences. (2) The number and composition of these repeats is not very interesting scientifically and probably varies a lot among humans anyway. (3) The effort at understanding the variation among humans is much more interesting and important (see the fantastic work of Evan Eichler's group, for example) than obsessing about these gaps of trivial importance. In contrast, when people are talking of "finishing" the genomes of chimp and macaque and other organisms, they are speaking of further efforts at improving the quality of these draft genomes. No one even imagines that sufficient effort will be expended any time soon to get these anywhere close to the quality of the human genome sequence.

I may be wrong here but I was under the impression that there were several bits of the genome that, as yet, *couldn't* be finished. Very repetitive C/G sequences, or impossible spice sites and things.

Of course, teaching always lags behind research a bit, so I could be very wrong there. :)

Hi,
the black parts on the genone could not be done due to technological reasons. The blue parts are as good as it gets, so it's perfectly reasonable to consider our genome 'done': it simply couldn't have been done 'more'. The black parts are not doable (or rather: not doable with reasonable time, cost and effort using current techniques), for now.

You could always argue that the blue parts could be done better (improve error rate, assembly, annotation etc.), but this is in fact happening all the time, and subsequent builds of the genome are affected by this ongoing work.

There is no 'absolute' genome, because we still don't quite know how genome works etc. (imagine that a few years ago we didn't know about microRNAs). But in terms of raw sequence, it's done.

It seems that now the goal is to have complete (in the current reference genome sense) diploid genomes from as many individuals as possible. BTW: the reference genome we all know is not a diploid genome, and not a genome from any person, it's a composite from couple of individuals ;-) - as such it's not too good as a reference, but it's the best there is.

cheers
yot

Yot-
The black parts of the genome (in the picture above) are certainly more difficult, but work is being done to obtain sequence information for these regions. See http://genomereference.org for more information.
But- the idea is to make the reference even better, it currently represents some highly variant regions with 'alternate loc'. Producing these alternate loci allow us to make the reference assembly better- in fact, we show 8 different haplotypes at the MHC locus (the representation in the reference chromosome + 7 alternate loci) so GRCh37 can represent more diversity than a diploid assembly at a given locus.

Thanks noyk,

I just finished reading the paper. yeah!