From population genetics to linguistics

The relationship between language families and historical population genetics has a long history. In the 19th and early 20th centuries anthropologists were wont to substitute and synthesize the connections discerned in linguistic relationships with those of presumed biological affinities. This resulted in great hilarity. Older works sometimes labeled the Finns a "Mongoloid" people because of their Uralic language. But once the physical substrate of genetic inheritance (DNA) was ascertained some correspondences did emerge.


The figure to the left is from an L. L. Cavalli-Sforza paper, Genes, peoples,âandâlanguages. The correspondence between gene families and language families is clear. From the paper:

Most patterns found in the analysis of human living populations are likely to be consequences of demographic expansions, determined by technological developments affecting food availability, transportation, or military power. During such expansions, both genes and languages are spread to potentially vast areas. In principle, this tends to create a correlation between the respective evolutionary trees. The correlation is usually positive and often remarkably high. It can be decreased or hidden by phenomena of language replacement and also of gene replacement, usually partial, due to gene flow.

Genetic variation and languages are both characteristics of individuals & populations. One might imagine that gene flow between groups might be modulated by linguistic affinity between groups, or, linguistic affinity between groups might be modulated by gene flow between the groups. Cavalli-Sforza's colleague Marcus Feldman has asserted that the correlation does indeed emerge out of biases in mating patterns more explicitly of late.

Language and genes are passed from parents to offspring. But, there are clearly differences in terms of the specific constraints on inheritance. When it comes to genes we have both the Mendelian abstraction as well as DNA as a concrete substrate. Parent-offspring transmission is symmetrical (from both parents), subject to mutation, segregation, recombination, etc. Though there are attempts to model language, to my knowledge there is not such robust theoretical understanding of the inheritance of language from parents to offspring, in particular the biological substrate which acquires language (I do not class the arguments about deep structure in linguistics in the same class as Mendelian and DNA models of genetics).

Of course there is the reality of great differences in transmission of language and genes. In the domain of language horizontal transmission is critical to understanding its distribution & evolution (I am aware that horizontal gene transfer is important in biological evolution, but not so much in the scope and species we're talking about). One's parents may speak a different language because language acquisition and fluency is also dependent on peers in a way that genetic variation is not. Additionally, language transmission from parents need not be symmetrical, one may acquire the language of one parent but not the other. One may speak the same language as one's parents, but with a different accent (that one of one's peer group). Interestingly, the exception to this rule of accents are individuals with some socialization dysfunction, such as autism.

There are also similarities between languages and genes. The molecular clock has its analogy in the lexical clock. There is also lexical admixture between languages, for example the heavy load of French-derived terms in modern English, the influence of Slavic upon the Baltic languages. A new paper in PLoS Biology leans on these last similarities to utilize the Structure framework to flesh out the relationships of the language of New Guinea & Australia, what was once "Sahul" during the last Ice Age. The author's summary from Explaining the Linguistic Diversity of Sahul Using Population Models:

About one-fifth of all the world's languages are spoken in present day Australia, New Guinea, and the surrounding islands. This corresponds to the boundaries of the ancient continent of Sahul, which broke up due to rising sea levels about 9000 years before present. The distribution of languages in this region conveys information about its population history. The recent migration of the Austronesian speakers can be traced with precision, but the histories of the Papuan and Australian language speakers are considerably more difficult to reconstruct. The speakers of these languages are presumably descendants of the first migrations into Sahul, and their languages have been subject to many millennia of dispersal and contact. Due to the antiquity of these language families, there is insufficient lexical evidence to reconstruct their histories. Instead we use abstract structural features to infer population history, modeling language change as a result of both inheritance and horizontal diffusion. We use a Bayesian phylogenetic clustering method, originally developed for investigating genetic recombination to infer the contribution of different linguistic lineages to the current diversity of languages. The results show the underlying structure of the diversity of these languages, reflecting ancient dispersals, millennia of contact, and probable phylogenetic groups. The analysis identifies 10 ancestral language populations, some of which can be identified with previously known phylogenetic groups (language families or subgroups), and some of which have not previously been proposed.

Here's a map of Sahul during the Ice Age:


The current consensus seems to be that the modern populations of New Guinea & Australia are descended from the original "Out of Africa" migration which occurred ~50,000 years ago (in particular, the "Southern Route" which swept along the northern fringe of the Indian Ocean). I don't think this should be taken to be the last word though, we know that the dingo arrived from Southeast Asia within the last 10,000 years, so there was always contact between Australia and the islands to the north & west. Though the dingo mtDNA seems to coalesce into one recent lineage, implying one founding event, which is curiously analogous to the dominant model of Australian settlement.

In any case, the results of this paper is where the action is, so I'll just show you the figures.

Here is a map, with colors illustrating the putative language families:


Here are the results on the map with K = 10. That is, 10 ancestral "populations":


And here is the bar chart, again K = 10 is the primary bar to look at (reedited):


This is just a baby step. Without more utilization of this method we'd probably want to hold off on any new insights. But here is something from the discussion to note:

The results of the structural feature analysis do not of course replace those derived by vocabulary methods of either the traditional or the computational cladistic kinds. Where the cognate-based methods are applicable they yield finer-grained groupings than can likely be achieved by structural data alone, for the principled reason that there is a restricted design space for structural features...But because known families are by-and-large recapitulated by clustering of structural features, it is reasonable to assume that hitherto unrelatable clusters discovered by the algorithm are plausible candidates for genealogical relationships. If further research shows up even a small number of possible cognates, this may be taken as more than just chance similarities.

We believe that the results obtained by this method have important ramifications for population genetic studies. When the data on mtDNA, Y chromosome, and autosomal markers are compared with the linguistic populations identified on the basis of structural features, as was done for example in...for Island Melanesia, we can expect significant progress in our understanding of the early colonization of Sahul.

Utilizing the same method on both genetic and linguistic data should be interesting, and perhaps give us a better fine-grained grasp on the different population-level dynamics of change of these two traits. One should expect that language should separate more sharply across ethno-linguistic boundaries than gene frequencies, so over the short-term one might expect that there should be more gene-flow across boundaries of related than non-related languages. But, deviations from expectation are important, because they might point to more complex and perhaps drastic historical-demographic processes in the distant past.

Citation: Reesink G, Singer R, Dunn M, 2009 Explaining the Linguistic Diversity of Sahul Using Population Models. PLoS Biol 7(11): e1000241. doi:10.1371/journal.pbio.1000241

More like this

The Kusunda of Terai valley of Nepal have some linguistic overlap with various Papuan languages, and possibly Ainu.

Proposed high-level and deep-time "clades" in linguistics like Nostratic and Eurasiatic are highly speculative I hear.

Can't they still not figure out things like if Japanese are Korean, or what Dravidian languages are next closely related to, let alone that those huge clades exist!? Only to the level of families such as Indo-European, Austranesian etc. do the phylogenies seem to be very certain; everything else seems wishy-washy.

I don't know that much, but it seems the phylogenies of languages at high levels are far less certain compared to deep-time splits such as knowing the Eukaryotes form a clade (and I really wonder if we'll ever know to that level of precision; I hope we can do one day) .Plus, we don't even know if languages have one origin, like life presumedly did.

deadpost, yeah. with phylogenetics based on DNA you have various regions of the genome which evolve at different rates and so can be used to inspect at different time depths. such clarity is rarer in linguistics, though i do think that certain classes of words tend to change much more slowly. isn't there something about irregular verbs in that vein?


Finding higher-level language groupings and relationships than the well-established Indo-European / Semitic / Altaic / Finno-Ugric groups is a can of worms, with lots of cranks and lots of very tentative theories by legit scholars. They take several forms: trying to find super-groups like Ural-Altaic (a flop IIRC) or Nostratic, and second, trying either to place isolates into existing groups or to join them together. The isolates of northern and Western Eurasia are few -- Basque, the Caucasian languages (two or three whole families), Burushaski, Yukagir, Gilyak, and maybe one or two others -- and attempts are frequently made to group them or to add some of them to existing groups (e.g. Athabaskan).

I think that the tripartite division of the American languages is solid (Inuit etc. + Athabaskan + everyone else) but it doesn't really say anything about the vast majority of American languages.

In recent studies "areal effects" have been noted: geographically-adjacent but historically unrelated languages swapping traits. You can also hypothesize punctuated euqilibria of creolization followed by elaboration which, after 2 or 3 cycles, might make the tracing of relationships difficult. And finally, a lot of languages have been studied only very inadequately, so some of the grand theories have been worked up from defective observations.

There's a hard core of well established research that it would be foolish to doubt, but after that it tails off into conjecture very quickly.Caveat emptor.

By John Emerson (not verified) on 18 Nov 2009 #permalink

I sort of agree with the previous comment. The association of language with genetics does loosely exist, but it may be just all coincidence.

Anatolian Turks speak Turkic but they owe more to the pre Seljuk or Ottoman Turks native peoples for their genetics than the Turkic speaking invaders. Similarly the Finns or Hungarians speak non I.E languages by are not that different from their neighbors, if genetic drift or founder effects are taken into account.

I am European, of the Maltese ethnicity, and the Maltese speak a Semitic langugage albeit rather mixed with more normal European languages. I have seen the studies on Maltese people, they are basically Southern European, and probably less "Middle Eastern" or "North African" than their near Southern European neighbors.

I have had some big arguments with bigots who think, language and genetics are like hand to glove in connection. The connection is loose. I.E languages are originally from Asia, and entered Europe late in its history, and totally subdued all native European languages. Languages like Etruscan have been labeled as foreign, from Asia Minor, and immigrant when the language most likely existed in Asia Minor and Southern Europe long before any Hellene or Italic speaker got off his horse and put foot on European soil. The foreigners making the natives the foreigners like the Anglo-Saxons did to the Welsh, which means foreigner in Old English.

My contention is that language groups can only be traced back to about 10 ky. Most haplogroups used to distinguish people are older than 10 ky, usually much older.Existing Ethnic groups and nations are all basically whippernippers compared with the age of haplogroup. English evolved from Low German with much accretions from Romance languages, even pronouns from other Germanic languages, and is totally a different language to Old English, and modern English are also totally different.

My arguments always centered on, the discontinuity of genetics with languages and the discontinuity of modern peoples with their supposed ancestors which increases with time. European haplogroup R1b was probably the same frequency 5,000 years ago, but 10,000 years ago it may not have existed in Europe at all. It is believed the I.E speaker were R1b and R1a and entered in the Bronze Age.

Thanks for highlighting Oceanian languages. This part of the world is often ignored. The dingo may have entered only 3,000 years ago, and is tied to the extinction of the Tasmanian devil and Thylacine on the Australian mainland. Aborigines have had contact with Papuans through the Torres Strait islanders, and from various Indonesians who have come to Northern Australian waters to fish.

The Kusunda of Terai valley of Nepal have some linguistic overlap with various Papuan languages,


and possibly Ainu.

Never heard of that. Source, please. Last I checked, the only existing proposal on what Ainu might be was Austric.

Can't they still not figure out things like if Japanese are Korean, or what Dravidian languages are next closely related to, let alone that those huge clades exist!?

Well, it's very hard work (you have to deal with hundreds of languages), and there are only about 5 published cladistic analyses of language families so far.

Ural-Altaic (a flop IIRC)

Yes, and a 19th-century one at that.

Languages like Etruscan have been labeled as foreign, from Asia Minor, and immigrant when the language most likely existed in Asia Minor and Southern Europe long before any Hellene or Italic speaker got off his horse and put foot on European soil.

According to genetics, both the people and the cattle of Tuscany do come from Asia Minor. There, however, Etruscan seems to have been part of a small language family that was most likely already present before Indo-European languages spread there.

By David MarjanoviÄ (not verified) on 19 Nov 2009 #permalink

Ponto wrote "I.E languages are originally from Asia, and entered Europe late in its history."

IE languages have a European origin. This is the scientific majority opinion since decades.