The relationship between language families and historical population genetics has a long history. In the 19th and early 20th centuries anthropologists were wont to substitute and synthesize the connections discerned in linguistic relationships with those of presumed biological affinities. This resulted in great hilarity. Older works sometimes labeled the Finns a “Mongoloid” people because of their Uralic language. But once the physical substrate of genetic inheritance (DNA) was ascertained some correspondences did emerge.
The figure to the left is from an L. L. Cavalli-Sforza paper, Genes, peoples, and languages. The correspondence between gene families and language families is clear. From the paper:
Most patterns found in the analysis of human living populations are likely to be consequences of demographic expansions, determined by technological developments affecting food availability, transportation, or military power. During such expansions, both genes and languages are spread to potentially vast areas. In principle, this tends to create a correlation between the respective evolutionary trees. The correlation is usually positive and often remarkably high. It can be decreased or hidden by phenomena of language replacement and also of gene replacement, usually partial, due to gene flow.
Genetic variation and languages are both characteristics of individuals & populations. One might imagine that gene flow between groups might be modulated by linguistic affinity between groups, or, linguistic affinity between groups might be modulated by gene flow between the groups. Cavalli-Sforza’s colleague Marcus Feldman has asserted that the correlation does indeed emerge out of biases in mating patterns more explicitly of late.
Language and genes are passed from parents to offspring. But, there are clearly differences in terms of the specific constraints on inheritance. When it comes to genes we have both the Mendelian abstraction as well as DNA as a concrete substrate. Parent-offspring transmission is symmetrical (from both parents), subject to mutation, segregation, recombination, etc. Though there are attempts to model language, to my knowledge there is not such robust theoretical understanding of the inheritance of language from parents to offspring, in particular the biological substrate which acquires language (I do not class the arguments about deep structure in linguistics in the same class as Mendelian and DNA models of genetics).
Of course there is the reality of great differences in transmission of language and genes. In the domain of language horizontal transmission is critical to understanding its distribution & evolution (I am aware that horizontal gene transfer is important in biological evolution, but not so much in the scope and species we’re talking about). One’s parents may speak a different language because language acquisition and fluency is also dependent on peers in a way that genetic variation is not. Additionally, language transmission from parents need not be symmetrical, one may acquire the language of one parent but not the other. One may speak the same language as one’s parents, but with a different accent (that one of one’s peer group). Interestingly, the exception to this rule of accents are individuals with some socialization dysfunction, such as autism.
There are also similarities between languages and genes. The molecular clock has its analogy in the lexical clock. There is also lexical admixture between languages, for example the heavy load of French-derived terms in modern English, the influence of Slavic upon the Baltic languages. A new paper in PLoS Biology leans on these last similarities to utilize the Structure framework to flesh out the relationships of the language of New Guinea & Australia, what was once “Sahul” during the last Ice Age. The author’s summary from Explaining the Linguistic Diversity of Sahul Using Population Models:
About one-fifth of all the world’s languages are spoken in present day Australia, New Guinea, and the surrounding islands. This corresponds to the boundaries of the ancient continent of Sahul, which broke up due to rising sea levels about 9000 years before present. The distribution of languages in this region conveys information about its population history. The recent migration of the Austronesian speakers can be traced with precision, but the histories of the Papuan and Australian language speakers are considerably more difficult to reconstruct. The speakers of these languages are presumably descendants of the first migrations into Sahul, and their languages have been subject to many millennia of dispersal and contact. Due to the antiquity of these language families, there is insufficient lexical evidence to reconstruct their histories. Instead we use abstract structural features to infer population history, modeling language change as a result of both inheritance and horizontal diffusion. We use a Bayesian phylogenetic clustering method, originally developed for investigating genetic recombination to infer the contribution of different linguistic lineages to the current diversity of languages. The results show the underlying structure of the diversity of these languages, reflecting ancient dispersals, millennia of contact, and probable phylogenetic groups. The analysis identifies 10 ancestral language populations, some of which can be identified with previously known phylogenetic groups (language families or subgroups), and some of which have not previously been proposed.
Here’s a map of Sahul during the Ice Age:
The current consensus seems to be that the modern populations of New Guinea & Australia are descended from the original “Out of Africa” migration which occurred ~50,000 years ago (in particular, the “Southern Route” which swept along the northern fringe of the Indian Ocean). I don’t think this should be taken to be the last word though, we know that the dingo arrived from Southeast Asia within the last 10,000 years, so there was always contact between Australia and the islands to the north & west. Though the dingo mtDNA seems to coalesce into one recent lineage, implying one founding event, which is curiously analogous to the dominant model of Australian settlement.
In any case, the results of this paper is where the action is, so I’ll just show you the figures.
Here is a map, with colors illustrating the putative language families:
Here are the results on the map with K = 10. That is, 10 ancestral “populations”:
And here is the bar chart, again K = 10 is the primary bar to look at (reedited):
This is just a baby step. Without more utilization of this method we’d probably want to hold off on any new insights. But here is something from the discussion to note:
The results of the structural feature analysis do not of course replace those derived by vocabulary methods of either the traditional or the computational cladistic kinds. Where the cognate-based methods are applicable they yield finer-grained groupings than can likely be achieved by structural data alone, for the principled reason that there is a restricted design space for structural features…But because known families are by-and-large recapitulated by clustering of structural features, it is reasonable to assume that hitherto unrelatable clusters discovered by the algorithm are plausible candidates for genealogical relationships. If further research shows up even a small number of possible cognates, this may be taken as more than just chance similarities.
We believe that the results obtained by this method have important ramifications for population genetic studies. When the data on mtDNA, Y chromosome, and autosomal markers are compared with the linguistic populations identified on the basis of structural features, as was done for example in…for Island Melanesia, we can expect significant progress in our understanding of the early colonization of Sahul.
Utilizing the same method on both genetic and linguistic data should be interesting, and perhaps give us a better fine-grained grasp on the different population-level dynamics of change of these two traits. One should expect that language should separate more sharply across ethno-linguistic boundaries than gene frequencies, so over the short-term one might expect that there should be more gene-flow across boundaries of related than non-related languages. But, deviations from expectation are important, because they might point to more complex and perhaps drastic historical-demographic processes in the distant past.
Citation: Reesink G, Singer R, Dunn M, 2009 Explaining the Linguistic Diversity of Sahul Using Population Models. PLoS Biol 7(11): e1000241. doi:10.1371/journal.pbio.1000241