Population Structure Analyses Provide Insight into the Source Populations Underlying Rural Isolated Communities in Illinois
We have previously hypothesized that relatively small and isolated rural communities may experience founder efects, defined as the genetic ramifications of small population sizes at the time of a community's establishment. To explore this, we used an Illumina Infinium Omni2.5Exome-8 chip to collect data from 157 individuals from four Illinois communities, three rural and one urban. Genetic diversity estimates of 999,259 autosomal markers suggested that the reduction in heterozygosity due to shared ancestry was approximately 0, indicating a randomly mating population. An eigenanalysis, which is similar to a principal component analysis but run on a genetic coancestry matrix, conducted in the SNPRelate R package revealed that most of these individuals formed one cluster, with a few putative outliers obscuring population variation. An additional eigenanalysis on the same markers in a combined data set including the 2,504 individuals in the 1000 Genomes database found that most of the 157 Illinois individuals clustered into one group in close proximity to individuals of European descent. A final eigenanalysis of the Illinois individuals with the 503 individuals of European descent (within the 1000 Genomes Project) revealed two clusters of individuals and likely two source populations; one British and one consisting of multiple European subpopulations. We therefore demonstrate the feasibility of examining genetic relatedness across Illinois populations and assessing the number of source populations using publicly available databases. When assessed, population structure information can contribute to the understanding of genetic history in rural populations.
Population Structure, Genome-Wide Markers, Source Populations, Founder Effects, Rural Communities
Two key characteristics of many rural communities in the US Midwest are that they were founded several hundred years ago and that little migration has occurred compared with similar communities in Africa, Asia, and Europe (described in Jenkins et al. 2016). While many non-genetic factors may explain a substantial amount of increased incidence of certain diseases in these rural communities (for specific examples, see Befort et al. 2012; Hines and Markossian 2012; Henry et al. 2014), quantification of a possible genetic predisposition to diseases in such communities could assist efforts to account for and minimize disease risk. It is therefore critical to compare and contrast [End Page 31] genetic characteristics of rural populations to those from urban populations. This will particularly enable the testing of our hypothesis that small and isolated rural communities may experience genetic founder efects to a greater extent than their more urban peers (Jenkins et al. 2016). Such founder effects may influence disease susceptibility and have long-lasting impacts (Rudan 1999). We hypothesize that a small town, founded by a small number of individuals and relatively geographically isolated, can remain afected by the initial founder efect over hundreds of years. Similar examples have been observed previously (e.g., the island of Sardinia), where geography presents a physical barrier to travel (Portas et al. 2010).
Researchers can use genetic data to estimate how closely related individuals in a population are to each other, as well as to determine if members of a rural community have a single or multiple source population(s) (the location of the population's origin; Falush et al. 2003; Wang et al. 2007). Determining if there is more than one source population is an important step for examining population structure; multiple source populations would suggest higher initial genetic diversity than a single source population and minimize any impacts of a founder efect. The ability to use genetic data to quantify subpopulation structure is an important factor in population studies (Wacholder et al. 2000; Thomas and Witte 2002; Campbell et al. 2005). Population structure analyses can be performed with large numbers of single nucleotide polymorphisms (SNPs) using small amounts of DNA and commercially available SNP chips. Given the use of genetic data from such chips in previous research (Vaags et al. 2012; Terao et al. 2013; De Vivo et al. 2014; Mayba et al. 2014; Machiela et al. 2016), it appears that they are well suited for quantifying subpopulation structure in rural isolated populations and hence provide insight into the impact of founder effects and isolation on current community genetic diversity.
Genome-wide markers obtained from SNP chips can also be used to obtain measures of genetic diversity. Such measures include average gene diversity over loci, which estimates overall population diversity (Nei 1987). Population similarity can be measured using Wright's indices, including Wright's inbreeding coefficient, FIS, which examines the reduction in heterozygosity in a population due to shared ancestry (Wright 1950). This measure can help estimate the relatedness of individuals within a population. Typical values of FIS in European populations have been reported in German populations (–0.0010 to 0.0108) (Steffens et al. 2006) and several Iberian populations: Basques (0.0000), Navarre (0.015), and Pass Valley (0.0144) (Cardoso et al. 2017). Thus, measures of average gene diversity over loci and FIS could indicate if rural populations have less diversity and/or appear to exhibit genetic drift, including a genetic bottleneck or founder effect, compared to other world populations.
Beyond measuring average gene diversity and FIS, we speculate that another critical analysis leading to accurate quantification of subpopulation structure in rural populations would be to compare their genetic relationships with various populations throughout the world. Such an analysis could facilitate the identification of source populations and provide insight into the presence of founder effects. The undertaking of such an endeavor is now possible given the availability of whole-genome sequenced data sets such as those from the 1000 Genomes Project (1KGP; 1000 Genomes Project Consortium 2015), POPRES (Nelson et al. 2008), AncestryDNA.com, the Human Genome Diversity Project (Cann et al. 2002), and HapMap projects (International HapMap Consortium 2003). Genome-wide markers segregating in both the rural populations and these whole-genome sequenced data sets could then be analyzed to quantify genetic relationships and identify source populations. Approaches such as STRUCTURE (Pritchard et al. 2000), principal component analysis (Price et al. 2006), and ADMIXTURE (Alexander et al. 2009) are adequate for using genome-wide markers to infer which subpopulations are present in the resulting combined data sets. However, advances in methodologies, including the eigenanalysis approach of Zheng and Weir (2016), now make it possible to characterize which ancestral populations underlie individuals living in rural communities by directly incorporating into the calculations the probability of markers being identical by descent.
The purpose of this study was to examine whole-genome SNP data from individuals from three rural and one urban population in Illinois, USA, and characterize their genetic properties, including genetic diversity and relatedness. To [End Page 32] achieve this, we characterized the genetic properties of these individuals and then compared them to the 1KGP database. We hypothesized that such an assessment could shed light on potential founder effects and suggest genetic differentiation from more urban populations.
Materials and Methods
Illinois IsoPop Data Set
The individuals comprising the Isolated Populations Project (IsoPop) data set, as well as the methods used to recruit them, have been described elsewhere (Dean et al. 2017). Briefly, 176 individuals were recruited from three rural communities (70 individuals from community 1, 30 from community 2, and 41 from community 3) and one urban community (35 individuals; community 4) in Illinois. The three rural communities were thought to have been settled in the past 300 years and are relatively isolated (Jenkins et al. 2016); they are between 100 and 400 miles from one another, and the nearest urban center to each community is located between 30 and 60 miles away (Wiley Jenkins, pers. comm., September 14, 2019). In addition to providing genealogical information and saliva samples, the participants took surveys and engaged in community forums. The genealogy information was used to remove individuals that were first-degree relatives with an already-recruited participant so as to not artificially inflate the degree of relatedness within the groups. This project was approved by the Southern Illinois University School of Medicine institutional review board (Springfield Committee for Research Involving Human Subjects; no. 15-328), and all participants provided informed consent.
DNA Extraction and Marker Identification of IsoPop Individuals
Extraction of DNA was carried out using an Oragene prepIT-L2P kit (DNA Genotek) following the standard protocol with a few modifications. Incubation occurred in a heat block for 2–24 hours (protocol suggested 2 hours of incubation). Rehydration of the DNA pellets occurred by incubating at 50°C for 1 hour or more as needed. Sample concentration was assessed using the Qubit assay (ThermoFisher). The average DNA concentration obtained was 89.43 µg/ml, with a range of 0.281–500 µg/ml. All samples, their population, and DNA concentration are listed in Supplementary Table S1.
Samples were aliquoted into separate tubes and taken to the Keck Biotechnology Sequencing Center at the University of Illinois at Urbana-Champaign. A water sample was included in the run to assess contamination, had a call rate of 0.4522, and was removed from analyses. Next, DNA samples were run on Illumina Infinium Omni2.5Exome-8 BeadChips (Illumina Inc., San Diego, CA, USA) according to the Illumina LCG Assay Protocol (part no. 15023139, rev. D). Sequencing was carried out on the Illumina iScan to genotype 2,612,357 markers from the human genome. Sample results were viewed in Genome Studio, and the "positive/negative" column was exported using a Dell PC with 64 GB RAM. We removed a total of 19 individuals that either had a call rate < 0.90, as suggested by other studies (Verdu et al. 2014), or were first-degree relatives to another individual (as reported by genealogical data), resulting in a total of 157 IsoPop individuals that were analyzed.
Genomic data from the 1KGP consists of 2,504 individuals from 26 subpopulations across five continents and has been previously described (Birney and Soranzo 2015; 1000 Genomes Project Consortium 2015). In brief, the 1KGP investigators sampled adult, "legally competent" individuals who are not from vulnerable or identifiable populations, using protocols that were in accordance with standard ethical guidelines (https://www.internationalgenome.org). Individuals in the database were self-reported to be healthy and gave their gender and ethnicity. The entirety of genomic data from the 1KGP contains 88 million variant sites (1000 Genomes Project Consortium 2015) and was collected using whole-genome sequencing.
The computational workflow is depicted in Figure 1. To quantify trends of population structure between and within the IsoPop population and the individuals in the 1KGP, we first obtained a subset of informative SNPs. The raw IsoPop data generated from Genome Studio were exported as tables. These tables were loaded into RStudio [End Page 33]
using the data.table package, where insertions and deletions were removed, as well as genotypes with a call rate < 90% (RStudio Team 2015). The 1KGP data were downloaded at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/. To match IsoPop with 1KGP data set, only SNPs on the forward strand were kept. Additionally, support files provided by Illumina (https://support.illumina.com/downloads.html) were used to convert SNP IDs into the reference SNP ID number. None of the individuals exceeded a threshold of 10% missing data. This data set was then converted into HapMap format, and TASSEL (Bradbury et al. 2007) was used to convert these data to VCF format. Next, PLINK (Purcell et al. 2007) was used to remove SNPs with more than two alleles or more than 5% missing data. The reference allele was converted to the reference genome GRCh37 using PLINK 2.0. The resulting IsoPop data set used for subsequent analysis was composed of 157 individuals and 999,259 autosomal SNPs.
Genetic Diversity Estimates
PGDSpider (Lischer and Excoffier 2012) was used to convert VCF files to Arlequin project format (Excoffier and Lischer 2010). Arlequin (version 184.108.40.206) was used to calculate FIS and average gene diversity using the approach of Nei (1987) across each marker and averaged for each chromosome. This was done to assess how genetically related these populations are to each other and potentially parse out founder effects. These FIS values were calculated and graphed along the chromosomes for both the IsoPop and 1KGP individuals using VCFtools (Danecek et al. 2011).
Eigenanalysis Using EIGMIX
The procedure described in Zheng and Weir (2016) was used to assess the presence of source populations in the IsoPop data set. In summary, this eigenanalysis differs from a traditional principal component analysis in that a coancestry matrix from the SNP data is used. This analysis was conducted on three different subsets of the data: (a) on data comprising of only the 157 IsoPop individuals, (b) on the combined IsoPop data set and the 2,504 individuals from the 1KGP data set, and (c) using the IsoPop data set and the subset of 503 individuals in the 1KGP data set from five European subpopulations. This analysis was conducted using the SNPRelate package in R. All scripts used for these analyses are publicly available at https://github.com/AmandaO8, and the coancestry matrix of all individuals used in this analysis is presented as Supplementary Table S1 and visualized in Supplementary Figure S1.
Genetic Diversity of IsoPop Individuals Is Comparable to Other European Populations
Estimates of genetic diversity in the IsoPop individuals for each chromosome are shown in Supplementary Table S2. The observed FIS values were all near 0, with only chromosomes 8 and 9 [End Page 34] having positive values, indicating that there has been random mating in these populations. By population, values for average gene diversity (using the approach described in Nei 1987) over loci are all close to 0.3 for each chromosome. These values are similar to those of other European populations and suggest that the IsoPop individuals are as genetically diverse as a typical population of European descent.
Using the same 999,259 SNPs considered in the IsoPop data set, FIS values were also calculated for the 1KGP individuals, and the results are graphed in Figure 2. This enabled the direct comparison of heterozygosity between the IsoPop individuals and the 1KGP individuals. The IsoPop populations had smaller FIS values than did the full set of 2,504 1KGP individuals, suggesting lower levels of heterozygosity. However, the results also show that the distribution of FIS values among the 503 1KGP individuals from five European subpopulations was similar to those of the four IsoPop communities. This suggests that the IsoPop individuals are less genetically diverse than individuals in the 1KGP as a whole, but similar to the 1KGP subset of individuals from European-descended populations.
Comparison with 1KGP Data Suggests Multiple Source Populations from Europe
To test for the presence of observable founder effects among the IsoPop populations, we conducted an eigenanalysis of 999,259 autosomal genome-wide markers that segregated among these individuals (Figure 3; Supplementary Figures S2–S5). Most IsoPop individuals were in close proximity to one another on the plot of the first two eigenvectors, with four individuals far removed from the main cluster of individuals. Thus, all but four of the individuals (two from community 1, and two from community 3; both of these communities are rural) in the IsoPop data set cluster together, suggesting that most individuals are descended from a single source population and that the remaining four individuals are likely from two other source populations. Even with the removal of these four observations, most individuals still cluster with one another (Figure 3). The two sets of individuals outside of the main cluster that group with one another are respectively from the same communities, suggesting the possibility that some individuals are related and did not report it or were
unaware. Unexpectedly, the urban population does not appear to be any more diverse than the rural populations.
To further assess the genetic relatedness between the IsoPop individuals, we next conducted an eigenanalysis on the same set of 999,259 markers using the IsoPop data set combined with the 2,504 individuals from the 1KGP database. The resulting plot of the first two eigenvalues (Figure 4) revealed that most of the IsoPop individuals formed one cluster. Additional plots from this analysis are included as Supplementary Figures S6–S8. This cluster overlaps with the 1KGP European individuals and is farthest from the 1KGP individuals with Asian and African ancestry. This result suggests (a) that IsoPop individuals are more closely related to each other than to other world populations, and (b) that their source population is most likely Europe.
A final eigenanalysis was conducted with the IsoPop individuals and the 503 individuals of European descent from the 1KGP. The corresponding plot summarizing results from the first two eigenvalues (Figure 5) had three main groups and three outlier individuals. Additional plots from this analysis are included as Supplementary Figures S9–S11. Many IsoPop individuals from each [End Page 35]
population cluster with those of Great Britain, including all individuals of the urban population (community 4). Additionally, many individuals from the rural populations (communities 1, 2, and 3) group with people of northern and western European ancestry living in Utah, Finland, Spain, and Tuscany (CEU, FIN, IBS, and TSI, respectively). These more refined results supersede the immediately previous findings from the original eigenanalysis by suggesting that the rural populations have multiple European source populations and likely had several founding groups. This also indicates that the urban population (community 4) only has Great Britain as a source population and might be less diverse than the rural populations. [End Page 36]
The use of genomic markers from high-throughput genotyping data to compare the relatedness between individuals in rural communities and those in publicly available databases could help identify founder effects and source populations. To assess the capability of such an approach, we analyzed genetic data from 157 individuals living in four communities in Illinois (three of which were rural) and used state-of-the-art statistical approaches to compare their genetic similarity to the 2,504 individuals comprising the 1KGP database. Given the novelty of these IsoPop data, these results provided an initial glance into the genetic diversity underlying these individuals. In particular, our first finding was not only that the three rural communities were indistinct from each other but also that they were indistinct from the urban "control" population. This indicates that genetic founder effects may not be present in these isolated rural communities and that community endogamy is not so reduced in rural areas as to influence observable genetic differences compared to a more urban area.
We next examined the IsoPop data in relation to the globally representative 1KGP data set. Our first finding was that the eigenanalysis primarily grouped the IsoPop individuals into one cluster (Figure 4), which was closest to the subset of 1KGP European individuals, suggesting the IsoPop individuals are more closely related to Europeans than to other groups. Our results are also consistent with our theoretical expectations based on the genealogical data suggesting that most IsoPop individuals are descended from people of European ancestry. Further evidence of the presence of a single European source population is provided by the respective plots of the first two eigenvectors clustering most of the IsoPop individuals into their own group, suggesting that the vast majority of them are closely related (Figure 3, Supplementary Figures S3–S5). However, the individuals situated distantly from the cluster could be obscuring some of the variation in these populations.
The plot of the first two eigenvectors from the eigenanalysis of the IsoPop and the 1KGP European subpopulation suggests genetic similarity with British, Finnish, Spanish, Tuscan, and people of northern and western European ancestry living in Utah (Figure 5). While Figure 4 (IsoPop + total 1KGP) suggests one European source population, Figure 5 (IsoPop + 1KGP European subset) suggests multiple European source populations underlying most IsoPop individuals. Thus, the tight clustering of the IsoPop individuals with these populations potentially rules out the possibility of a single source population. The genetic diversity estimates of the IsoPop population are similar to those found in other studies, in that the ranges of the estimates [End Page 37] overlap but the average values were different. For example, the FIS values for the IsoPop range from –0.00652 to 0.00177 and have mostly negative values, whereas those of German populations range from –0.0022 to 0.0108 and have mostly positive values (Steffens et al. 2006). These FIS values indicate that the IsoPop individuals are no more or less closely related to one another than expected under the null model of random mating.
Using the combined marker data from the IsoPop data set and the 1KGP database, we were able to infer that most of the IsoPop individuals are descended from at least two source populations originally from Europe. This result could aid researchers studying the prevalence of diseases in the three rural Illinois communities included in the IsoPop data set by suggesting that any alleles among these individuals that cluster with one of the source populations could have similar levels of genetic predisposition. More broadly, our study serves as a proof of concept, demonstrating that it is possible to use an approach like an eigenanalysis to compare the genetic characteristics between a set of individuals and those from a public database and, moreover, to show that it is possible to obtain biologically meaningful results.
In general, research into the risk of disease attributable to specific gene variants and combinations is often hindered by a low carrier frequency of specific mutations among the general population (Sherry et al. 2001). While this study showed insignificant differences across the rural and urban communities, we did not examine specific loci known/thought to be associated with increased disease risk. Additional work would specifically examine and characterize such loci, as the identification of specific populations with naturally increased carrier frequencies of specific gene variants of interest would greatly justify the utility of ecological and historical studies of diseases (Peltonen et al. 2000). This in turn could result in multiple studies of how individual genetic makeup may impact such important topics as drug efficacy (Arbitrio et al. 2019) and variable outcomes to environmental exposure (Ryu et al. 2018).
There are several limitations to this work. First, the rural communities were chosen as a matter of feasibility and convenience. While the community size was based on the work of Portas et al. (2010), true isolation is more difficult to ascertain objectively. Rigor in assessing isolation and randomization of selection would be needed for future work. Second, the choice of the urban "control" is also based on convenience. While the urban community has a population exceeding 110,000, it is by no means a major metropolitan center, as reflected in the fact that its population appeared to be related to just one European subpopulation (i.e., British) and is therefore potentially problematic to use as an urban control population. This could be because the sample urban population was not fully representative of the whole population, or perhaps this particular urban population is not as genetically diverse as others. Future studies could use a larger (or multiple) urban community to circumvent this potential problem. Third, follow-up studies that trace the history of settlement of these communities could complement and potentially substantiate the findings of the work presented here.
Another important limitation of this study lies in the genotyping technologies used to obtain markers in the IsoPop and 1KGP data sets. In addition to the potential for ascertainment bias inherent in using arrays such as Illumina (described in Lipka et al. 2015), additional bias could arise from the fact that an Illumina chip was used to call markers in the IsoPop data set whereas whole genome sequencing was used in the 1KGP data set. However, our results suggest that such an ascertainment bias could be minimal. For example, there is close proximity between the IsoPop and 1KGP individuals in Figures 4 and 5. We also observed a similar distribution of rare and common SNPs in the IsoPop individuals and the 503 1KGP individuals of European descent (Supplementary Figure S12, Supplementary Table S3), as well as similar linkage disequilibrium patterns (Supplementary Figure S13). Nevertheless, future studies should use the same sequencing platforms to obtain markers in all data sets that are evaluated. Finally, we encourage future studies to compare data from rural isolated communities from the US Midwest with marker data from other publicly available data sets besides the 1KGP data set that include more than just the five subpopulations of European descent, such as POPRES (Nelson et al. 2008). Such a comparison could shed further light on the number of source populations underlying these isolated communities. [End Page 38]
This study utilized nearly 1 million high-quality SNPs and, to the best of our knowledge, is the first to use SNP data to examine both population structure and founder effects in nonreligious rural isolated populations in the US Midwest. The potential impact of founder effects on the genetic diversity of rural communities over hundreds of years could be the source of future studies. For example, these studies could consider advanced statistical approaches for quantifying such effects and, moreover, parse out these effects on the population over multiple generations, from the founding of the population to the present day. Lastly, other SNP chips or whole-genome sequencing could be used to obtain a larger marker set (and thus capture an even greater amount of genomic diversity) and in a combined analysis with these IsoPop individuals and other publicly available data sets.
1. Program in Ecology, Evolution and Conservation Biology, University of Illinois Urbana–Champaign, Urbana, Illinois, USA.
2. Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana–Champaign, Urbana, Illinois, USA.
3. Department of Crop Sciences, University of Illinois Urbana–Champaign, Urbana, Illinois, USA.
4. Department of Population Science and Policy, Southern Illinois University School of Medicine, Springfield, Illinois, USA.
5. Rural and Minority Health Research Center, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina, USA.
6. Epidemiology and Biostatistics, Southern Illinois University School of Medicine, Springfield, Illinois, USA.
7. Department of Anthropology, College of Liberal Arts and Sciences, University of Illinois Urbana–Champaign, Urbana, Illinois, USA.
Supplementary Table S1. IsoPop Individuals by Population, and Quality Assessments Made by Qubit Concentration and the Illumina Call Rate
Go to the following link to view the table: https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?filename=0&article=1151&context=humbiol_preprints&type=additional