African and Asian genome sequences: the last of the single human genome papers?


The latest issue of Nature is just as it should be: nearly wall-to-wall human genomics, with a special focus on personal genomics (more on that later).

The main event is a potential historical milestone: quite possibly the last two papers ever to be published in a major journal describing the sequencing of single human genomes from healthy individuals1. The papers, which both appear to be open access (kudos to Nature for that decision) describe the analysis of the first Asian genome by researchers at the Beijing Genomics Institute, and the sequencing of the first African genome by a cast of thousands centred around next-gen sequencing company Illumina.

Both genomes were sequenced using next-generation sequencing technology from Illumina, which generates sequence information in the form of very short (35-50 base pair) reads. Although each read is extremely short and relatively error-prone compared to reads from old-fashioned sequencing methods, the sheer number of reads generated by the Illumina technology make whole-genome sequencing feasible: both studies stitched together in excess of 3 billion of these reads to assemble their genomes. That means that each base in the genome was covered, on average, by over 30 independent reads (as opposed to an average of around 7 reads for the Watson and Venter genomes) - more than enough to compensate for the increased error rate of the Illumina platform.

These papers are both important technical achievements, the first of many publications that will emerge over the next few years taking advantage of these short-read technologies to characterise entire human genomes (Watson's genome was also sequenced using next-generation technology, but on a platform that generated much longer reads than the Illumina system at a correspondingly lower throughput). The Illumina platform allows the assembly of an individual genome sequence far more quickly and cheaply than old-school Sanger chemistry, or the 454 platform used for Watson's genome, paving the way for affordable personal genome sequences.

However, many technical challenges still remain. Short read technology struggles to map large insertion/deletion polymorphisms (so-called structural variation), and is almost completely unable to generate accurate sequence data for the 10-15% of the genome that lies in highly repetitive regions. In addition, such platforms are largely unable to determine whether two heterozygous variants are found together on the same copy of a chromosome, or on separate copies (a problem known as phasing). Generating a complete genome sequence in the strictest definition, including accurate phasing, awaits the development of ultra-long read single molecule sequencing platforms.

The papers also illustrate the challenges that lie ahead for personal genomics: an analysis of the Asian genome for potential disease-causing mutations revealed one heterozygous (i.e. single-copy) mutation known to cause deafness, and a possible increase in genetic risk for tobacco addiction and Alzheimer's, but little in the way of convincing, medically actionable results. As I've said before, the technology here is moving much faster than our understanding of the underlying biology - you and I will be able to afford our genome sequences long before we have much idea what they mean.

The other important message from these papers is that we can no longer learn very much in terms of biology from individual genome sequences, at least from healthy people. Each additional genome sequence does contribute a list of new genetic variants, but these returns are rapidly diminishing: in both studies only ~25% of the single-base variants are novel. This proportion is admittedly substantially higher for insertion/deletion and larger structural variants (for which detection approaches are still immature), but that too will diminish with each new genome added to the database and as sequencing technology improves. By the time the 1000 Genomes Project has dumped its last petabyte of data on the web there will be relatively few polymorphisms (variants with a frequency of greater than 1%) left to discover, at least in the European, East Asian and West African populations.

So attention has already well and truly turned to converting sequence into biological meaning - and that's a job that will ultimately require many hundreds of thousands of genome sequences, each attached to information about biological traits and disease status. That means the end of the brief era of high-profile "single human genome" papers, which started in a sense with the anonymised, pooled and fragmented human reference sequences published in 2001, peaked with the celebrity genomes of Venter and Watson in 2007/2008, and now ends (I suspect) with two anonymous non-European genomes.

Of course, we will still see a number of papers describing whole genomes of diseased individuals, particularly cancer samples - indeed, there is one such paper in the same issue of Nature, which you can read about at PolITiGenomics from David Dooling, one of the authors on the paper. [Added in edit following prompting in comments: This paper has its own set of firsts: first female genome published (Leiden University's "first female genome" got a lot of media attention, but is yet to emerge in print); first disease genome sequenced; first paper publishing multiple genome sequences (one of the cancer and one of a healthy skin sample from the same patient); and probably other firsts I haven't thought of. If it were in any other issue of Nature I'd be all over it, but I've been completely distracted by the other cool stuff in this issue!]

But nonetheless, the age of the one-genome paper is fast drawing to a close. Human genetics now moves into a phase of new challenges and rewards - the era of population genomics.

1Update: John Hawks spoils my argument by noting that we will still see fossil and archaeological single genome sequence papers in major journals. Drat! Like any good scientist, I have revised my hypothesis in the light of opposing evidence: it now states that we will see no further single genome papers in major journals using DNA from healthy modern humans. Any other exceptions I missed?

Subscribe to Genetic Future.

Wang et al. (2008). The diploid genome sequence of an Asian individual Nature, 456 (7218), 60-65 DOI: 10.1038/nature07484

Bentley et al. (2008). Accurate whole human genome sequencing using reversible terminator chemistry Nature, 456 (7218), 53-59 DOI: 10.1038/nature07517

Images of average East Asian and African faces from Face Research.

More like this

Pushkarev, D., Neff, N., & Quake, S. (2009). Single-molecule sequencing of an individual human genome Nature Biotechnology DOI: 10.1038/nbt.1561 Yes, it's yet another "complete" individual genome sequence, following on the heels of Craig Venter, James Watson, an anonymous African male (twice,…
Last year, Craig Venter became the first single person to have his genome sequence published (doi:10.1371/journal.pbio.0050254). That genome was sequenced using the old-school Sanger technique. It marked the second time the complete human genome had been published (which led to some discussion as…
I discussed the second-generation sequencing company Complete Genomics a couple of weeks ago (see here and here). These guys are unique in that they offer their technology only as a service, rather than the usual business model of selling platforms to genomics facilities, and a highly restricted…
The Gene Sherpa predicts that Complete Genomics will win the Archon X Prize in Genomics in 2010. In the comments, Keith Robison is wisely skeptical. I agree with Keith - it's unlikely that the X Prize will be won this year, and if it is the winner is unlikely to be Complete Genomics. For those…

This is a really interesting post - something that is a bit mind-boggling. It is no longer an exciting result when a single genome is sequenced... unbelievable.

I just wanted to bring up a point concerning your statement that illumina sequencing allows one to sequence a genome easily. This is true only in re-sequencing scenarios, where there is an established and verified reference genome with which you can align the short sequence reads. What will be really exciting is when we can get longer reads at this level of throughput where we can sequence anything.

We are at a point with sequencing technology, that at least for organisms with good reference genomes, we are no longer limited by the amount of genetic information we have. We are getting to the point where in some of these systems (humans, drosophila, stickleback, zebrafish etc..) where we could resequence all of the individuals within a given population. Mapping traits, looking at variation, measuring whole-genome selection. These are now possible.

A scenario that, as an evolutionary biologist, I drool over. How great would it be to take a population of organisms and sequence them all. Impose one generation of selection on some trait of interest. Resequence the genomes for generation 1. Repeat.

This is already going on in bacterial systems. Soon we can begin to really get at how selection acts on whole genomes.

Heh - while you were writing your comment I was busily editing my post to make a similar point about the need for long-read technology, but you beat me to it. This is an excellent point that is worth repeating frequently amid the hype about "whole genome sequences" (and in fact probably warrants its own post.)

I couldn't agree more about the exciting prospects for evolutionary biology of having essentially unlimited sequence data. However, I'm also rapidly coming to terms with the horrific informatic demands of generating this much data: the sort of experiment you're describing would generate petabytes of data that would need to somehow be filtered, processed, analysed and stored. How many evolutionary biologists currently have the skills required to manage data-sets on that scale?

We're all going to have to learn pretty damn fast if we want to take full advantage of this technology...

As I understand it, isn't the cancer genome paper you linked to also remarkable for being the first female genome sequenced? Give the ladies their love too!

Hi Beth,

Fair enough - I've updated the post to reflect the blow against the genomic patriarchy struck by this paper.

It seems that in the near future short read technologies like e.g. Illumina's will cease to be "short" moving read length into the 75-100 bp area where mapping gets a lot more accurate.
Furthermore, the short read platforms are releasing new protocols for paired-end read preparation that will soon allow an assembly analysis to mix paired-end samples with insert sizes ranging from 200-10000 bp.
Both of the above improvements will alleviate some of the problems raised here regarding how to resolve repeat regions and to some extent also phasing.

By the time the 1000 Genomes Project has dumped its last petabyte of data on the web there will be relatively few polymorphisms (variants with a frequency of greater than 1%) left to discover, at least in the European, East Asian and West African populations.

But those SNPs with a minor allele frequency less than 1% can be very informative in population genetics. Unfortunately, people are tentative to call singletons as polymorphisms, and they're often classified as sequencing errors. With more resequenced human genomes, we can increase our confidence that rare SNPs are real polymorphisms.