Due to an error at the printer, the original article page range (620–623) was incorrect. The correct pagination is 623–626 and is reflected in this updated article. Click here for the corrected PDF.
The demographic impact of the origin of agriculture in Europe has been a long-standing area of interest and controversy in human genetics. For contemporary students of human population genetics, reviewing past work on this question can serve as a wonderful case study that illuminates different approaches that one can take to study historical processes by using genetics and can highlight the inherent challenges. One chapter in that history involves Robert Sokal's contributions, including his critique of synthetic maps.
Robert Sokal became keenly interested in this question, as well as other potential factors structuring genetic diversity in Europe, and he worked with great rigor and insight. His work spanned genetics (e.g., Sokal et al. 1991), linguistics (e.g., Sokal et al. 1990), and also included the analysis of a large-scale ethnohistorical database that he created (e.g., Sokal et al. 1991). Whereas most of his work on European human genetic diversity took place in the late 1980s and early 1990s, in 1999 he published a short piece in Human Biology revisiting and critiquing the analysis of synthetic maps (maps based on principal components analysis, PCA) that Cavalli-Sforza and his colleagues started using in the late 1970s (Cavalli-Sforza et al. 1994; Menozzi et al. 1978; Piazza et al. 1991). This made for a lively exchange in the pages of Human Biology. The original piece (Sokal et al. 1999a) was published with a response from Cavalli-Sforza's group (Rendine et al. 1999), and later in the same year Sokal published a letter critiquing the Rendine et al. (1999) response (Sokal et al. 1999b).
Previous to the 1999 exchange, both Sokal and Cavalli-Sforza's groups had, in separate papers, agreed that population genetic data from Europe support the existence of clinal patterns in allele frequencies that align with the expansion of agriculture, and that such clinal patterns support the demic diffusion hypothesis. Cavalli-Sforza's group argued this point most strongly by using synthetic maps, showing that in their data the PC1 map in Europe has a northwest/southeast gradient that mirrors the expansion of agriculture (e.g., Menozzi et al. 1978). In contrast, Sokal's group used an analysis based on distance metrics (Sokal et al. 1991). Specifically, they assessed how genetic distances correlate with the timing of the origin-of-agriculture and found a correlation, even after accounting for the partial correlation due to geographic distance. So, whereas both groups [End Page 623] ultimately were in agreement about their conclusions regarding clines, they differed in how they reached those conclusions.
In Sokal et al.'s 1999 critique of synthetic maps, the focus is on a more general problem in population genetic studies: the incompleteness of most samples. For logistical reasons, it is challenging to get a perfectly uniform sampling over the study area. The dataset that both Sokal and Cavalli-Sforza's group analyzed had sampling locales that were geographically clustered and intervening areas that went unsampled. As Sokal highlighted, Cavalli-Sforza's team dealt with the incomplete data by spatially interpolating the observed allele frequencies along a complete grid across Europe and then running the PCA to produce the synthetic map. This approach falls into a category sometimes referred to by statisticians as using "pseudodata"—where some initial data are used to infer an intermediate variable, and then analysis is run on those intermediate values. The intermediate variables are pseudodata in the sense that they are treated as real data when they are not directly observed.
The problem of taking such an approach in a spatial context is not simply that the underlying uncertainty in downstream inferences may be underrepresented. As Sokal and colleagues made clear in their paper, spatial interpolation may falsely create autocorrelation in the data, and, "The extra spatial autocorrelation falsely enhances or distorts true trends in such data." To quantify the effect, they conducted numerical experiments with data...