How Columbus was not a seer

A week ago I pointed out that in some visualizations of world wide population variation South Asians & mestizos seem to overlap which each other to a great extent. The reason for this is that both populations can be modeled as admixtures between two separate, but related, populations. Mestizos are the products of pairings between Europeans and indigenous America populations, while South Asians seem to be a stabilized hybrid population which emerged from the fusion of a West Eurasian (closely related to European) and East Eurasian (distantly related to East Asians) populations. The East Eurasian ancestors of South Asians may be distantly related to indigenous American populations, but in a world wide scale the relationship is relatively close (i.e., compared to Europeans vs. indigenous Americans). So when mapped onto a plot of genetic variation incorporating world wide populations South Asians and mestizos naturally resemble each other. That said, a commenter observes:

Great example of how two dimensions lose information.

Given how different the two populations are genetically, guarantee that the third component separates them pretty cleanly.

Correct. A new paper illustrates this. Magnitude of Stratification in Human Populations and Impacts on Genome Wide Association Studies:

Genome-wide association studies (GWAS) may be biased by population stratification (PS). We conducted empirical quantification of the magnitude of PS among human populations and its impact on GWAS. Liver tissues were collected from 979, 59 and 49 Caucasian Americans (CA), African Americans (AA) and Hispanic Americans (HA), respectively, and genotyped using Illumina650Y (Ilmn650Y) arrays. RNA was also isolated and hybridized to Agilent whole-genome gene expression arrays. We propose a new method (i.e., hgdp-eigen) for detecting PS by projecting genotype vectors for each sample to the eigenvector space defined by the Human Genetic Diversity Panel (HGDP). Further, we conducted GWAS to map expression quantitative trait loci (eQTL) for the ~40,000 liver gene expression traits monitored by the Agilent arrays. HGDP-eigen performed similarly to the conventional self-eigen methods in capturing PS. However, leveraging the HGDP offered a significant advantage in revealing the origins, directions and magnitude of PS. Adjusting for eigenvectors had minor impacts on eQTL detection rates in CA. In contrast, for AA and HA, adjustment dramatically reduced association findings. At an FDR = 10%, we identified 65 eQTLs in AA with the unadjusted analysis, but only 18 eQTLs after the eigenvector adjustment. Strikingly, 55 out of the 65 unadjusted AA eQTLs were validated in CA, indicating that the adjustment procedure significantly reduced GWAS power. A number of the 55 AA eQTLs validated in CA overlapped with published disease associated SNPs. For example, rs646776 and rs10903129 have previously been associated with lipid levels and coronary heart disease risk, however, the rs10903129 eQTL was missed in the eigenvector adjusted analysis.

The main point of the paper is to smoke out population substructure which might generate spurious false positives in health-related genome-wide association studies. The problem is pretty obvious. Imagine you have a medical study with a lot of blacks and whites, and you just assume they're all genetically basically the same. Then you look for associations of particular genetic variants within the population which has disease X. Of course, it could be that blacks or whites tend to have more of disease X than the other population, and, it turns out hat blacks and whites also tend to differ on a whole lot of genes. Modern human population genetics might have "disproved race," but it sure is very interested in "population substructure."

Patterns of between population variation can be visualized by extracting out the independent dimensions of variance, and plotting them against each other. Generally the charts I post on this illustrate the two dimensions which can explain the most variance in the data set (the alleles frequencies across all the SNPs in this case), principal components 1 and 2. But the comment above highlights that there are many other dimensions, though they explain less of the variance.

One issue that the authors of the above paper pinpoint is that the nature of these dimensions are sensitive to the populations which you include in your original data set to generate them. They distinguish here between the dimensions generated from the full HGDP data set, which includes ~50 world populations, and visualizations which rely only on one population. In this study they project their own samples of European, African and Hispanic Americans on the dimensions extracted out of the HGDP data set, and also onto dimensions generated from the populations themselves. As an example, consider Hispanic American projected upon the dimensions of variation constructed from Asians, Africans and Europeans, or, Hispanic Americans projected upon the dimensions of variation extracted from only the variance extant within their own population. From what I could tell they actually didn't find that correcting for total genome variation using these two methods was particularly helpful in generating greater clarity as to the role of population substructure in producing false positives. So let's focus on on the visualizations, which go back to the title of the post.

The first chart has PC 1 & PC 2 from the HGDP populations, with their sample of about 50 African Americans projected onto it:

i-4ac43dd7789af8bc31fb3dd3f844cced-col1.png

Pretty much zero surprise here. I would be willing to assume that the self-identified African American who clusters with Europeans is an error of some sort (e.g., a sample mix-up), but other studies show the same tendency quite frequently. I conclude then that there are actually people who are inadvertently "passing" as black, at least culturally (on the outside they probably look whiter than G. K. Butterfield).

The second chart now has PC 1 & PC 3. So the dimension of variation which explains the second largest proportion of variance has now been replaced by the dimension which explains the third largest proportion.

i-4e2d228c070864e0b31fb7579588c99c-col2.png

Now Native Americans are distinct from East Asians in the HGDP sample. This is because of PC 3. This goes to the commenter's point that looking at more dimensions of variation gives us a better sense of real population differences.

Jumping back to PC 1 and 2, but with Hispanics projected onto the HGDP generated space:

i-72d7d2bf5667a2e788981ede88e47366-col3.png

I don't know the provenance of the Hispanics, but it looks to me that they're likely to include many Puerto Ricans, seeing as there's a large amount of African admixture here. Nevertheless, you still see the overlap between Hispanics and South Asians that you did with the Gujarati-Mexican comparison, though attenuated. So let's look at PC 1 & PC 3.

i-2f6c7a3af8a8af38638affd08b3e4ff8-col4.png

And yes, all of a sudden mestizos and South Asians do not overlap, and in fact South Asians are further from mestizos than Europeans or Middle Easterners. One could have predicted this from the previous chart.

Finally, I want to round out the inspection by looking at two charts which project European Americans onto PC 1, PC 2 and PC 3. The European Americans are black points.

i-5f7e3fdf298582384c45cbcdbe1eeef0-col5.png

i-2734ae19ccaf12451614dfecbe381824-col6.png

Note that European American outliers seem to have a bias toward drifting in the direction of the Native Americans and African Americans. I don't discount the possibility of errors here, but it is important to note that deviations away from the HGDP European cluster in the last chart are toward the two groups which European Americans have historically been in contact with in North America.

Note: The subjects specific to this study seem to have been resident in the eastern half of the United States. This would tend to support my supposition that they are less likely to be Mexican Americans, and more likely to be Puerto Rican or Cuban Americans, if they were Hispanic.

Citation: Hao K, Chudin E, Greenawalt D, Schadt EE (2010) Magnitude of Stratification in Human Populations and Impacts on Genome Wide Association Studies. PLoS ONE 5(1): e8695. doi:10.1371/journal.pone.0008695

More like this

I believe that in some contexts this is called "churning" -- the loss of information and approach to entropy. I've talked to linguists about whether something like this could happen to a language, making its antecedents unrecoverable. Two processes that happen to language are creolization and the development of a Sprachbund".

In the first, a language is stripped down to a minimum for contact with foreigners and made into a pidgin trade language which usually has vocabulary from two or several languages and a structure which is characteristic of most pidgins, but not necessarily of either donor language. The pidgin then becomes a creole when residents of the trade center grow up speaking mostly pidgin and the language develops beyond its rudimentary beginnings.

The Sprachbund is the trading back and forth of features between neighboring languages which are not historically related (from different language groups). One example is Romanian, which has picked up features from the surrounding Slavic languages -- the Balkan Sprckbund. Another is East Asia, where it is now thought that languages from several different unrelated or distantly related language groups (Austronesian, Tibeto-Burman, and maybe others) have picked up enough common features to make up a sort of adoptive family. The Sprachbund wikis are worth reading.

My theory was that of an unwritten language has endured, say, two cycles each of creolization and sprachbundization during (say) three thousand years (not impossible) ancestors before that time might be unrecoverable. It would essentially be a new isolate (as opposed to a survivor isolate).

The two processes are not unrelated, either. Every creole language would be part of a sprachbund comprised by its neighboring languages.

There's an example of this in the novel "The Good Soldier Schweik". Over the period of a century or more educated Czechs had picked up a German-type pronoun usage, whereas uneducated Czechs tended to stick to the Czech form, and nationalist Czechs insisted on the old form.

These factors might throw a monkey-wrench into attempts to build superfamilies larger than the known families (most famously Nostratic). They don't in any way discredit the established families (Indo-European, Semitic, Bantu, Malayo-Polynesian) but make work going beyond them difficult or impossible. The Turkish-Mongol-Manchu family has been questioned, though, and the language relationships of SE Asia are still up in the air.

By John Emerson (not verified) on 13 Jan 2010 #permalink

Why don't people also do 3D plots? With computers, it really isn't that difficult.

And if the data isn't too dense, then taking a snapshot of the 3D plot from a small set of different angles should be quite telling. (Or, you could make interactive 3D plots, that could rotate, hide and show data, etc.)

they often don't look that good on 2-D paper i think. OTOH, seems like there'd be a good place for it in the supplemental information with visualization software.

Maple can make 3d scatter plots. You can move them around with the mouse to look at them from different angles.