How big does the N need to be?

Estimating the number of unseen variants in the human genome:

...Consistent with previous descriptions, our results show that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the least diverse, with the European population in-between. In addition, our results show a clear distinction between the Chinese and the Japanese populations, with the Japanese population being the less diverse. To find all common variants (frequency at least 1%) the number of individuals that need to be sequenced is small (â¼350) and does not differ much among the different populations; our data show that, subject to sequence accuracy, the 1000 Genomes Project is likely to find most of these common variants and a high proportion of the rarer ones (frequency between 0.1 and 1%). The data reveal a rule of diminishing returns: a small number of individuals (â¼150) is sufficient to identify 80% of variants with a frequency of at least 0.1%, while a much larger number (> 3,000 individuals) is necessary to find all of those variants. Finally, our results also show a much higher diversity in environmental response genes compared with the average genome, especially in African populations.

The details of this matters for genetic architecture, especially for complex traits such as height & IQ.


More like this

Isn't the highlighted portion (diminishing returns of sample size) basic statistics? Seems odd to phrase it like the diminishing returns is the notable part instead of the remarkably low number needed for that percentage.

That there are diminishing returns for increasing sample size is well known by statisticians but some biologist need reminding. The results depend on the variants being distributed beta-binomial and the paper is an application of a 30 year old result by Efron. still, pretty impressive.

By statsquatch (not verified) on 01 Apr 2009 #permalink