Population substructure within China

The state of China has 1/5 of humanity within its borders, so it's genetic structure is of interest. It is obviously important for medical reasons to clarify issues of population structure so that disease susceptibility among the Han is well characterized, in particular with the heightened medical needs of an aging population in the coming generation. And of course, there are the nationalistic concerns. About 20 years ago L. L. Cavalli-Sforza reported that his South Chinese samples were genetically closer to Southeast Asians than North Chinese in The History and Geography of Human Genes. This result has been somewhat muddled in the past generation with the rise of uniparental markers (NRY and mtDNA passed through the male and female lineages) along with studies which utilize hundreds of thousands of SNPs. One thing that seems to be clear is that genes vary as a function of geography in China (just as they do pretty much everywhere).

Two new articles in AJHG shed some more light on this issue, Genomic Dissection of Population Substructure of Han Chinese and Its Implication in Association Studies:

To date, most genome-wide association studies (GWAS) and studies of fine-scale population structure have been conducted primarily on Europeans. Han Chinese, the largest ethnic group in the world, composing 20% of the entire global human population, is largely underrepresented in such studies. A well-recognized challenge is the fact that population structure can cause spurious associations in GWAS. In this study, we examined population substructures in a diverse set of over 1700 Han Chinese samples collected from 26 regions across China, each genotyped at â¼160K single-nucleotide polymorphisms (SNPs). Our results showed that the Han Chinese population is intricately substructured, with the main observed clusters corresponding roughly to northern Han, central Han, and southern Han. However, simulated case-control studies showed that genetic differentiation among these clusters, although very small (FST = 0.0002 â¼0.0009), is sufficient to lead to an inflated rate of false-positive results even when the sample size is moderate. The top two SNPs with the greatest frequency differences between the northern Han and southern Han clusters (FST > 0.06) were found in the FADS2 gene, which associates with the fatty acid composition in phospholipids, and in the HLA complex P5 gene (HCP5), which associates with HIV infection, psoriasis, and psoriatic arthritis. Ingenuity Pathway Analysis (IPA) showed that most differentiated genes among clusters are involved in cardiac arteriopathy (p < 10â101). These signals indicating significant differences among Han Chinese subpopulations should be carefully explained in case they are also detected in association studies, especially when sample sources are diverse.

And, Genetic Structure of the Han Chinese Population Revealed by Genome-wide SNP Variation:

Population stratification is a potential problem for genome-wide association studies (GWAS), confounding results and causing spurious associations. Hence, understanding how allele frequencies vary across geographic regions or among subpopulations is an important prelude to analyzing GWAS data. Using over 350,000 genome-wide autosomal SNPs in over 6000 Han Chinese samples from ten provinces of China, our study revealed a one-dimensional "north-south" population structure and a close correlation between geography and the genetic structure of the Han Chinese. The north-south population structure is consistent with the historical migration pattern of the Han Chinese population. Metropolitan cities in China were, however, more diffused "outliers," probably because of the impact of modern migration of peoples. At a very local scale within the Guangdong province, we observed evidence of population structure among dialect groups, probably on account of endogamy within these dialects. Via simulation, we show that empirical levels of population structure observed across modern China can cause spurious associations in GWAS if not properly handled. In the Han Chinese, geographic matching is a good proxy for genetic matching, particularly in validation and candidate-gene studies in which population stratification cannot be directly accessed and accounted for because of the lack of genome-wide data, with the exception of the metropolitan cities, where geographical location is no longer a good indicator of ancestral origin. Our findings are important for designing GWAS in the Chinese population, an activity that is expected to intensify greatly in the near future.

Below is a PC chart which shows PC1 on the x-axis and PC2 on the y-axis. In green are South Chinese, and in blue the North Chinese. Japanese are the cluster to the top left, and red represents the HapMap Chinese sample.

i-4380ed34cb2865a1a7134e1cd1892c3c-hansup1.png

And here's a visualization of the ancestries of individuals from particular provinces and dialect groups using Structure (right) and Frappe (left) (the K's represent 2 or 3 putative ancestral populations respectively). It's ordered by ancestry within the classes. The rough geographical correlate is north-south. Note the variance in Singapore; most Singaporean Chinese derive from Fujian (with a large Hakka minority, and some Malay admixture on the part of Baba Chinese), but there were enough disparate migratory events that you don't see a bottleneck and decrease in homogeneity compared to Chinese provinces. On the contrary. A minority of Singaporeans seem to be of North Chinese provenance, a result that would not surprising in Taiwan, where such a migration is historically documented (after the fall of Nationalist China), but is more curious in Singapore which was presumably part of the greater Fujianese Diaspora.

i-7331d71937b07188c4a9e8a71b3458eb-hansup2.png

Finally, here are pairwise Fst values. Remember that this captures the proportion of genetic variance between populations. Fst values between continental races is on the order of 0.15. This means 15% of the genetic variation is between races. The values below seem to show a maximum between province/dialect difference in China of about 0.5% of the genetic variation. But despite this small value, note how obvious it is above to differentiate individuals from northern and southern regions of China.

i-3001d312bf929760725b7d0536c2a581-hansup3.png

Here are some comparable Fst values from Europe:

0.001 = Bulgaria-Austria
0.002 = Poland-Sweden
0.003 = Northern Italy-Switzerland
0.004 = Spain-Sweden
0.005 = Russia-France

I've left out the highest Fst values in Europe, which are between Finns and Southern Italians, on the order of 0.015. But from these data it looks as if Han Chinese are in the same order of magnitude of variance as Europeans in terms of their genetics, but a factor or two lower. But it may be that the coverage of genetic variation is just not as thick in China so that outlier Han populations, the equivalent of Finns (perhaps Sinicized groups in Yunnan?), are out there waiting to push the mean variance higher. It is interesting, though not totally surprising, that different dialect groups in the same region exhibit large genetic differences. Language & genes often correlate because the former circumscribes the limits of marriage networks. The Teochew migrated from Fujian to Guangdong (to my knowledge they are the dominant Chinese group in Thailand), and are nearly as genetically distant from their Cantonese speaking neighbors as they are from North Chinese. Interestingly, the Hakka group who are derived from North Chinese migrants according their history, seem to be closer to "indigenous" South Chinese. Nevertheless, they exhibit less genetic difference from North Chinese than do Cantonese speakers in Guangdong. This is obviously the tip of the iceberg, I suspect that the genetic topography of South China in particular will be surprising because of its geographical fragmentation, the role of powerful clan networks, and the recurrent history of migration from the North China plain by groups who manage to maintain their identities (.e.g, Hakka).*

Citation: Jieming Chen, Houfeng Zheng, Jin-Xin Bei, Liangdan Sun, Wei-hua Jia, Tao Li, Furen Zhang, Mark Seielstad, Yi-Xin Zeng, Xuejun Zhang, and Jianjun Liu, Genetic Structure of the Han Chinese Population Revealed by Genome-wide SNP Variation, doi:10.1016/j.ajhg.2009.10.016

Citation: Shuhua Xu, Xianyong Yin, Shilin Li, Wenfei Jin, Haiyi Lou, Ling Yang, Xiaohong Gong, Hongyan Wang, Yiping Shen, Xuedong Pan, Yungang He, Yajun Yang, Yi Wang, Wenqing Fu, Yu An, Jiucun Wang, Jingze Tan, Ji Qian, Xiaoli Chen, Xin Zhang, Yangfei Sun, Xuejun Zhang, Bailin Wu, and Li Jin, Genomic Dissection of Population Substructure of Han Chinese and Its Implication in Association Studies, doi:10.1016/j.ajhg.2009.10.015

* It is attested that many groups emigrated from South China to North China, but it seems to me that these groups were simply absorbed. I suspect it has to do with the flat topography of the North China plain which does not allow for easy separation between groups. In South China the Hakka tended to farm the more marginal lands, in particular upland regions.

More like this

There's a new paper in the American Journal of Human Genetics following on from the paper on the genetics of metabolic traits that I posted on earlier in the week. This study explicitly focuses on the population structure of the Finns, and includes these maps showing the correlation between…
Kai Wang is a postdoctoral fellow at the Center for Applied Genomics, Children's Hospital of Philadelphia and an author on numerous genome-wide association studies. He left this lengthy comment as a response to my recent post on this comment by McClellan and King in Cell, and I felt it warranted…
Sound familiar? Well, good things come in pairs. A few days ago I posted on a paper which used a linkage analysis to come to the conclusion that an SNP on HERC2 was responsible for the variation in eye color in Europeans. Some background, a gene, OCA2, was implicated in the variation in eye…
Nature Genetics has just released six advance online manuscripts on the genetic architecture of complex metabolic traits. The amount of data in the manuscripts is overwhelming, so this post is really just a first impression; I suspect I'll have more to say once I've had time to dig into the juicy…

Some years ago on GNXP Classic, I related what I had been told by a distinguished Chinese medical specialist in Hong Kong, that all of the cases of naso-pharyngeal cancer that he had seen had been in Cantonese patients. He said "...in southern Chinese, in fact in Cantonese." I said "So it's genetic" and he said "Yes, it's genetic." He was clearly in no doubt that Cantonese were not just a dialect and cultural group, but that native Cantonese-speaking people were genetically differentiated from other Han, even other southern Han, based on his practical observations of disease susceptibility.

In southern China today you can still see walled villages populated by e.g. Hakka, although this physical separation/stratification has rapidly begun to disappear within the past generation or so.

By Sandgroper (not verified) on 25 Nov 2009 #permalink

So does this mean that there is no Chinese race now, at least in the European ethno-national sense? It seems the Chinese achieved what neither Caesar, Charlemagne, Bonaparte, or Brussels could. Forging one nation out of a group of disparate (but closely related) peoples.

So does this mean that there is no Chinese race now, at least in the European ethno-national sense?

well, i don't see as that surprising, do you? china's about the scale of continent europe. one important implication of this might be to be a bit cautious about extrapolating samples drawn from particular urban areas to all chinese for medical purposes. american chinese tend to be cantonese and fujianese, so american studies based on chinese americans using our citizens as proxies for chinese are going to have an automatic regional bias.

It seems the Chinese achieved what neither Caesar, Charlemagne, Bonaparte, or Brussels could. Forging one nation out of a group of disparate (but closely related) peoples.

i think to some extent this is wrong in the case of caesar. the peoples of the roman empire did view themselvse as romans, and their land as romania. the latinized-descendants of gauls in what became france are referred to as "romans" in the literature of the frankish period (similar in spain). what happened is that unlike china europe remained disaggregated after the collapse of rome.

There has been a remarkable continuity of Chinese history, culture and written language, but genes vary with geography at every scale. That is not to say that you cannot distinguish genetically a Chinese population from a population in a different area, just that there is stratification within the population.

http://scienceblogs.com/gnxp/2008/12/genetic_map_of_east_asia.php

http://scienceblogs.com/gnxp/2009/05/genetic_relations_of_different.php

By Sandgroper (not verified) on 25 Nov 2009 #permalink

When the papers refer to different "dialects", they are referring to differences of a type that elsewhere in the world would lead to reference to different "languages". Do I have that right?

By bioIgnoramus (not verified) on 25 Nov 2009 #permalink

yes, dialects are equivalent to languages. or in some cases, as in serbian and croatian, dialects are called languages :-)

bio, I'm not buying into that argument. I will just observe that mutually unintelligible speech is a barrier to exogamy. Except for invading armies, obviously.

By Sandgroper (not verified) on 25 Nov 2009 #permalink