publisher colophon
Abstract

We have previously hypothesized that relatively small and isolated rural communities may experience founder efects, defined as the genetic ramifications of small population sizes at the time of a community's establishment. To explore this, we used an Illumina Infinium Omni2.5Exome-8 chip to collect data from 157 individuals from four Illinois communities, three rural and one urban. Genetic diversity estimates of 999,259 autosomal markers suggested that the reduction in heterozygosity due to shared ancestry was approximately 0, indicating a randomly mating population. An eigenanalysis, which is similar to a principal component analysis but run on a genetic coancestry matrix, conducted in the SNPRelate R package revealed that most of these individuals formed one cluster, with a few putative outliers obscuring population variation. An additional eigenanalysis on the same markers in a combined data set including the 2,504 individuals in the 1000 Genomes database found that most of the 157 Illinois individuals clustered into one group in close proximity to individuals of European descent. A final eigenanalysis of the Illinois individuals with the 503 individuals of European descent (within the 1000 Genomes Project) revealed two clusters of individuals and likely two source populations; one British and one consisting of multiple European subpopulations. We therefore demonstrate the feasibility of examining genetic relatedness across Illinois populations and assessing the number of source populations using publicly available databases. When assessed, population structure information can contribute to the understanding of genetic history in rural populations.

Keywords

Population Structure, Genome-Wide Markers, Source Populations, Founder Effects, Rural Communities

Two key characteristics of many rural communities in the US Midwest are that they were founded several hundred years ago and that little migration has occurred compared with similar communities in Africa, Asia, and Europe (described in Jenkins et al. 2016). While many non-genetic factors may explain a substantial amount of increased incidence of certain diseases in these rural communities (for specific examples, see Befort et al. 2012; Hines and Markossian 2012; Henry et al. 2014), quantification of a possible genetic predisposition to diseases in such communities could assist efforts to account for and minimize disease risk. It is therefore critical to compare and contrast [End Page 31] genetic characteristics of rural populations to those from urban populations. This will particularly enable the testing of our hypothesis that small and isolated rural communities may experience genetic founder efects to a greater extent than their more urban peers (Jenkins et al. 2016). Such founder effects may influence disease susceptibility and have long-lasting impacts (Rudan 1999). We hypothesize that a small town, founded by a small number of individuals and relatively geographically isolated, can remain afected by the initial founder efect over hundreds of years. Similar examples have been observed previously (e.g., the island of Sardinia), where geography presents a physical barrier to travel (Portas et al. 2010).

Researchers can use genetic data to estimate how closely related individuals in a population are to each other, as well as to determine if members of a rural community have a single or multiple source population(s) (the location of the population's origin; Falush et al. 2003; Wang et al. 2007). Determining if there is more than one source population is an important step for examining population structure; multiple source populations would suggest higher initial genetic diversity than a single source population and minimize any impacts of a founder efect. The ability to use genetic data to quantify subpopulation structure is an important factor in population studies (Wacholder et al. 2000; Thomas and Witte 2002; Campbell et al. 2005). Population structure analyses can be performed with large numbers of single nucleotide polymorphisms (SNPs) using small amounts of DNA and commercially available SNP chips. Given the use of genetic data from such chips in previous research (Vaags et al. 2012; Terao et al. 2013; De Vivo et al. 2014; Mayba et al. 2014; Machiela et al. 2016), it appears that they are well suited for quantifying subpopulation structure in rural isolated populations and hence provide insight into the impact of founder effects and isolation on current community genetic diversity.

Genome-wide markers obtained from SNP chips can also be used to obtain measures of genetic diversity. Such measures include average gene diversity over loci, which estimates overall population diversity (Nei 1987). Population similarity can be measured using Wright's indices, including Wright's inbreeding coefficient, FIS, which examines the reduction in heterozygosity in a population due to shared ancestry (Wright 1950). This measure can help estimate the relatedness of individuals within a population. Typical values of FIS in European populations have been reported in German populations (–0.0010 to 0.0108) (Steffens et al. 2006) and several Iberian populations: Basques (0.0000), Navarre (0.015), and Pass Valley (0.0144) (Cardoso et al. 2017). Thus, measures of average gene diversity over loci and FIS could indicate if rural populations have less diversity and/or appear to exhibit genetic drift, including a genetic bottleneck or founder effect, compared to other world populations.

Beyond measuring average gene diversity and FIS, we speculate that another critical analysis leading to accurate quantification of subpopulation structure in rural populations would be to compare their genetic relationships with various populations throughout the world. Such an analysis could facilitate the identification of source populations and provide insight into the presence of founder effects. The undertaking of such an endeavor is now possible given the availability of whole-genome sequenced data sets such as those from the 1000 Genomes Project (1KGP; 1000 Genomes Project Consortium 2015), POPRES (Nelson et al. 2008), AncestryDNA.com, the Human Genome Diversity Project (Cann et al. 2002), and HapMap projects (International HapMap Consortium 2003). Genome-wide markers segregating in both the rural populations and these whole-genome sequenced data sets could then be analyzed to quantify genetic relationships and identify source populations. Approaches such as STRUCTURE (Pritchard et al. 2000), principal component analysis (Price et al. 2006), and ADMIXTURE (Alexander et al. 2009) are adequate for using genome-wide markers to infer which subpopulations are present in the resulting combined data sets. However, advances in methodologies, including the eigenanalysis approach of Zheng and Weir (2016), now make it possible to characterize which ancestral populations underlie individuals living in rural communities by directly incorporating into the calculations the probability of markers being identical by descent.

The purpose of this study was to examine whole-genome SNP data from individuals from three rural and one urban population in Illinois, USA, and characterize their genetic properties, including genetic diversity and relatedness. To [End Page 32] achieve this, we characterized the genetic properties of these individuals and then compared them to the 1KGP database. We hypothesized that such an assessment could shed light on potential founder effects and suggest genetic differentiation from more urban populations.

Materials and Methods

Illinois IsoPop Data Set

The individuals comprising the Isolated Populations Project (IsoPop) data set, as well as the methods used to recruit them, have been described elsewhere (Dean et al. 2017). Briefly, 176 individuals were recruited from three rural communities (70 individuals from community 1, 30 from community 2, and 41 from community 3) and one urban community (35 individuals; community 4) in Illinois. The three rural communities were thought to have been settled in the past 300 years and are relatively isolated (Jenkins et al. 2016); they are between 100 and 400 miles from one another, and the nearest urban center to each community is located between 30 and 60 miles away (Wiley Jenkins, pers. comm., September 14, 2019). In addition to providing genealogical information and saliva samples, the participants took surveys and engaged in community forums. The genealogy information was used to remove individuals that were first-degree relatives with an already-recruited participant so as to not artificially inflate the degree of relatedness within the groups. This project was approved by the Southern Illinois University School of Medicine institutional review board (Springfield Committee for Research Involving Human Subjects; no. 15-328), and all participants provided informed consent.

DNA Extraction and Marker Identification of IsoPop Individuals

Extraction of DNA was carried out using an Oragene prepIT-L2P kit (DNA Genotek) following the standard protocol with a few modifications. Incubation occurred in a heat block for 2–24 hours (protocol suggested 2 hours of incubation). Rehydration of the DNA pellets occurred by incubating at 50°C for 1 hour or more as needed. Sample concentration was assessed using the Qubit assay (ThermoFisher). The average DNA concentration obtained was 89.43 µg/ml, with a range of 0.281–500 µg/ml. All samples, their population, and DNA concentration are listed in Supplementary Table S1.

Samples were aliquoted into separate tubes and taken to the Keck Biotechnology Sequencing Center at the University of Illinois at Urbana-Champaign. A water sample was included in the run to assess contamination, had a call rate of 0.4522, and was removed from analyses. Next, DNA samples were run on Illumina Infinium Omni2.5Exome-8 BeadChips (Illumina Inc., San Diego, CA, USA) according to the Illumina LCG Assay Protocol (part no. 15023139, rev. D). Sequencing was carried out on the Illumina iScan to genotype 2,612,357 markers from the human genome. Sample results were viewed in Genome Studio, and the "positive/negative" column was exported using a Dell PC with 64 GB RAM. We removed a total of 19 individuals that either had a call rate < 0.90, as suggested by other studies (Verdu et al. 2014), or were first-degree relatives to another individual (as reported by genealogical data), resulting in a total of 157 IsoPop individuals that were analyzed.

1KGP Database

Genomic data from the 1KGP consists of 2,504 individuals from 26 subpopulations across five continents and has been previously described (Birney and Soranzo 2015; 1000 Genomes Project Consortium 2015). In brief, the 1KGP investigators sampled adult, "legally competent" individuals who are not from vulnerable or identifiable populations, using protocols that were in accordance with standard ethical guidelines (https://www.internationalgenome.org). Individuals in the database were self-reported to be healthy and gave their gender and ethnicity. The entirety of genomic data from the 1KGP contains 88 million variant sites (1000 Genomes Project Consortium 2015) and was collected using whole-genome sequencing.

Computational Methods

The computational workflow is depicted in Figure 1. To quantify trends of population structure between and within the IsoPop population and the individuals in the 1KGP, we first obtained a subset of informative SNPs. The raw IsoPop data generated from Genome Studio were exported as tables. These tables were loaded into RStudio [End Page 33]

Figure 1. Computational methods workflow. Programs used are in dark gray and, unless otherwise noted, were performed in RStudio. 1KGP, 1000 Genomes Project database.
Click for larger view
View full resolution
Figure 1.

Computational methods workflow. Programs used are in dark gray and, unless otherwise noted, were performed in RStudio. 1KGP, 1000 Genomes Project database.

using the data.table package, where insertions and deletions were removed, as well as genotypes with a call rate < 90% (RStudio Team 2015). The 1KGP data were downloaded at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/. To match IsoPop with 1KGP data set, only SNPs on the forward strand were kept. Additionally, support files provided by Illumina (https://support.illumina.com/downloads.html) were used to convert SNP IDs into the reference SNP ID number. None of the individuals exceeded a threshold of 10% missing data. This data set was then converted into HapMap format, and TASSEL (Bradbury et al. 2007) was used to convert these data to VCF format. Next, PLINK (Purcell et al. 2007) was used to remove SNPs with more than two alleles or more than 5% missing data. The reference allele was converted to the reference genome GRCh37 using PLINK 2.0. The resulting IsoPop data set used for subsequent analysis was composed of 157 individuals and 999,259 autosomal SNPs.

Genetic Diversity Estimates

PGDSpider (Lischer and Excoffier 2012) was used to convert VCF files to Arlequin project format (Excoffier and Lischer 2010). Arlequin (version 3.5.2.2) was used to calculate FIS and average gene diversity using the approach of Nei (1987) across each marker and averaged for each chromosome. This was done to assess how genetically related these populations are to each other and potentially parse out founder effects. These FIS values were calculated and graphed along the chromosomes for both the IsoPop and 1KGP individuals using VCFtools (Danecek et al. 2011).

Eigenanalysis Using EIGMIX

The procedure described in Zheng and Weir (2016) was used to assess the presence of source populations in the IsoPop data set. In summary, this eigenanalysis differs from a traditional principal component analysis in that a coancestry matrix from the SNP data is used. This analysis was conducted on three different subsets of the data: (a) on data comprising of only the 157 IsoPop individuals, (b) on the combined IsoPop data set and the 2,504 individuals from the 1KGP data set, and (c) using the IsoPop data set and the subset of 503 individuals in the 1KGP data set from five European subpopulations. This analysis was conducted using the SNPRelate package in R. All scripts used for these analyses are publicly available at https://github.com/AmandaO8, and the coancestry matrix of all individuals used in this analysis is presented as Supplementary Table S1 and visualized in Supplementary Figure S1.

Results

Genetic Diversity of IsoPop Individuals Is Comparable to Other European Populations

Estimates of genetic diversity in the IsoPop individuals for each chromosome are shown in Supplementary Table S2. The observed FIS values were all near 0, with only chromosomes 8 and 9 [End Page 34] having positive values, indicating that there has been random mating in these populations. By population, values for average gene diversity (using the approach described in Nei 1987) over loci are all close to 0.3 for each chromosome. These values are similar to those of other European populations and suggest that the IsoPop individuals are as genetically diverse as a typical population of European descent.

Using the same 999,259 SNPs considered in the IsoPop data set, FIS values were also calculated for the 1KGP individuals, and the results are graphed in Figure 2. This enabled the direct comparison of heterozygosity between the IsoPop individuals and the 1KGP individuals. The IsoPop populations had smaller FIS values than did the full set of 2,504 1KGP individuals, suggesting lower levels of heterozygosity. However, the results also show that the distribution of FIS values among the 503 1KGP individuals from five European subpopulations was similar to those of the four IsoPop communities. This suggests that the IsoPop individuals are less genetically diverse than individuals in the 1KGP as a whole, but similar to the 1KGP subset of individuals from European-descended populations.

Comparison with 1KGP Data Suggests Multiple Source Populations from Europe

To test for the presence of observable founder effects among the IsoPop populations, we conducted an eigenanalysis of 999,259 autosomal genome-wide markers that segregated among these individuals (Figure 3; Supplementary Figures S2S5). Most IsoPop individuals were in close proximity to one another on the plot of the first two eigenvectors, with four individuals far removed from the main cluster of individuals. Thus, all but four of the individuals (two from community 1, and two from community 3; both of these communities are rural) in the IsoPop data set cluster together, suggesting that most individuals are descended from a single source population and that the remaining four individuals are likely from two other source populations. Even with the removal of these four observations, most individuals still cluster with one another (Figure 3). The two sets of individuals outside of the main cluster that group with one another are respectively from the same communities, suggesting the possibility that some individuals are related and did not report it or were

Figure 2. Plot of observed FIS values (y-axis) of 999,259 SNPs. The x-axis shows the populations: the 1000 Genomes Project individuals (1KGP), individuals of European descent from the 1000 Genomes Project (EUR), and all the Illinois individuals combined (ISOPOP), followed by each population separately (ISOPOP1–4). The urban population is ISOPOP4.
Click for larger view
View full resolution
Figure 2.

Plot of observed FIS values (y-axis) of 999,259 SNPs. The x-axis shows the populations: the 1000 Genomes Project individuals (1KGP), individuals of European descent from the 1000 Genomes Project (EUR), and all the Illinois individuals combined (ISOPOP), followed by each population separately (ISOPOP1–4). The urban population is ISOPOP4.

unaware. Unexpectedly, the urban population does not appear to be any more diverse than the rural populations.

To further assess the genetic relatedness between the IsoPop individuals, we next conducted an eigenanalysis on the same set of 999,259 markers using the IsoPop data set combined with the 2,504 individuals from the 1KGP database. The resulting plot of the first two eigenvalues (Figure 4) revealed that most of the IsoPop individuals formed one cluster. Additional plots from this analysis are included as Supplementary Figures S6S8. This cluster overlaps with the 1KGP European individuals and is farthest from the 1KGP individuals with Asian and African ancestry. This result suggests (a) that IsoPop individuals are more closely related to each other than to other world populations, and (b) that their source population is most likely Europe.

A final eigenanalysis was conducted with the IsoPop individuals and the 503 individuals of European descent from the 1KGP. The corresponding plot summarizing results from the first two eigenvalues (Figure 5) had three main groups and three outlier individuals. Additional plots from this analysis are included as Supplementary Figures S9S11. Many IsoPop individuals from each [End Page 35]

Figure 3. Eigen plot of 153 IsoPop individuals, with the four distantly grouped individuals removed, using 999,259 SNPs. The x- and y-axes are the value of the first and second eigenvector, respectively. Individuals from the four IsoPop communities, labeled 01, 02, 03, and 04, are indicated with different symbols. Most IsoPop individuals cluster into one group.
Click for larger view
View full resolution
Figure 3.

Eigen plot of 153 IsoPop individuals, with the four distantly grouped individuals removed, using 999,259 SNPs. The x- and y-axes are the value of the first and second eigenvector, respectively. Individuals from the four IsoPop communities, labeled 01, 02, 03, and 04, are indicated with different symbols. Most IsoPop individuals cluster into one group.

Figure 4. Eigen plot of 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals using 999,259 SNPs. The x- and y-axes are the value of the first and second eigenvector, respectively. Individuals from the different world subpopulations of Africa (AFR), Americas (AMR), East Asia (EAS), South Asia(SAS), and Europe (EUR) and from the IsoPop (IL) populations are colored differently. The IsoPop individuals cluster into the EUR individuals from the 1000 Genomes database.
Click for larger view
View full resolution
Figure 4.

Eigen plot of 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals using 999,259 SNPs. The x- and y-axes are the value of the first and second eigenvector, respectively. Individuals from the different world subpopulations of Africa (AFR), Americas (AMR), East Asia (EAS), South Asia(SAS), and Europe (EUR) and from the IsoPop (IL) populations are colored differently. The IsoPop individuals cluster into the EUR individuals from the 1000 Genomes database.

population cluster with those of Great Britain, including all individuals of the urban population (community 4). Additionally, many individuals from the rural populations (communities 1, 2, and 3) group with people of northern and western European ancestry living in Utah, Finland, Spain, and Tuscany (CEU, FIN, IBS, and TSI, respectively). These more refined results supersede the immediately previous findings from the original eigenanalysis by suggesting that the rural populations have multiple European source populations and likely had several founding groups. This also indicates that the urban population (community 4) only has Great Britain as a source population and might be less diverse than the rural populations. [End Page 36]

Figure 5. Eigen plot of 503 individuals of European descent (EUR) from the 1000 Genomes database and 157 IsoPop individuals using 999,259 SNPs. The x- and y-axes are the value of the first and second eigenvector, respectively. Individuals from the different EUR subpopulations are plotted in gray and are represented by different symbols: Utah residents in CEPH (CEU), Finland (FIN), British in England and Scotland (GBR), Iberian populations in Spain (IBS), and Toscani in Italia (TSI). The IsoPop population (IL) and the four IsoPop populations, labeled 01, 02, 03, and 04, are plotted in blue and are represented by different symbols. The IsoPop individuals cluster into two groups.
Click for larger view
View full resolution
Figure 5.

Eigen plot of 503 individuals of European descent (EUR) from the 1000 Genomes database and 157 IsoPop individuals using 999,259 SNPs. The x- and y-axes are the value of the first and second eigenvector, respectively. Individuals from the different EUR subpopulations are plotted in gray and are represented by different symbols: Utah residents in CEPH (CEU), Finland (FIN), British in England and Scotland (GBR), Iberian populations in Spain (IBS), and Toscani in Italia (TSI). The IsoPop population (IL) and the four IsoPop populations, labeled 01, 02, 03, and 04, are plotted in blue and are represented by different symbols. The IsoPop individuals cluster into two groups.

Discussion

The use of genomic markers from high-throughput genotyping data to compare the relatedness between individuals in rural communities and those in publicly available databases could help identify founder effects and source populations. To assess the capability of such an approach, we analyzed genetic data from 157 individuals living in four communities in Illinois (three of which were rural) and used state-of-the-art statistical approaches to compare their genetic similarity to the 2,504 individuals comprising the 1KGP database. Given the novelty of these IsoPop data, these results provided an initial glance into the genetic diversity underlying these individuals. In particular, our first finding was not only that the three rural communities were indistinct from each other but also that they were indistinct from the urban "control" population. This indicates that genetic founder effects may not be present in these isolated rural communities and that community endogamy is not so reduced in rural areas as to influence observable genetic differences compared to a more urban area.

We next examined the IsoPop data in relation to the globally representative 1KGP data set. Our first finding was that the eigenanalysis primarily grouped the IsoPop individuals into one cluster (Figure 4), which was closest to the subset of 1KGP European individuals, suggesting the IsoPop individuals are more closely related to Europeans than to other groups. Our results are also consistent with our theoretical expectations based on the genealogical data suggesting that most IsoPop individuals are descended from people of European ancestry. Further evidence of the presence of a single European source population is provided by the respective plots of the first two eigenvectors clustering most of the IsoPop individuals into their own group, suggesting that the vast majority of them are closely related (Figure 3, Supplementary Figures S3S5). However, the individuals situated distantly from the cluster could be obscuring some of the variation in these populations.

The plot of the first two eigenvectors from the eigenanalysis of the IsoPop and the 1KGP European subpopulation suggests genetic similarity with British, Finnish, Spanish, Tuscan, and people of northern and western European ancestry living in Utah (Figure 5). While Figure 4 (IsoPop + total 1KGP) suggests one European source population, Figure 5 (IsoPop + 1KGP European subset) suggests multiple European source populations underlying most IsoPop individuals. Thus, the tight clustering of the IsoPop individuals with these populations potentially rules out the possibility of a single source population. The genetic diversity estimates of the IsoPop population are similar to those found in other studies, in that the ranges of the estimates [End Page 37] overlap but the average values were different. For example, the FIS values for the IsoPop range from –0.00652 to 0.00177 and have mostly negative values, whereas those of German populations range from –0.0022 to 0.0108 and have mostly positive values (Steffens et al. 2006). These FIS values indicate that the IsoPop individuals are no more or less closely related to one another than expected under the null model of random mating.

Using the combined marker data from the IsoPop data set and the 1KGP database, we were able to infer that most of the IsoPop individuals are descended from at least two source populations originally from Europe. This result could aid researchers studying the prevalence of diseases in the three rural Illinois communities included in the IsoPop data set by suggesting that any alleles among these individuals that cluster with one of the source populations could have similar levels of genetic predisposition. More broadly, our study serves as a proof of concept, demonstrating that it is possible to use an approach like an eigenanalysis to compare the genetic characteristics between a set of individuals and those from a public database and, moreover, to show that it is possible to obtain biologically meaningful results.

In general, research into the risk of disease attributable to specific gene variants and combinations is often hindered by a low carrier frequency of specific mutations among the general population (Sherry et al. 2001). While this study showed insignificant differences across the rural and urban communities, we did not examine specific loci known/thought to be associated with increased disease risk. Additional work would specifically examine and characterize such loci, as the identification of specific populations with naturally increased carrier frequencies of specific gene variants of interest would greatly justify the utility of ecological and historical studies of diseases (Peltonen et al. 2000). This in turn could result in multiple studies of how individual genetic makeup may impact such important topics as drug efficacy (Arbitrio et al. 2019) and variable outcomes to environmental exposure (Ryu et al. 2018).

There are several limitations to this work. First, the rural communities were chosen as a matter of feasibility and convenience. While the community size was based on the work of Portas et al. (2010), true isolation is more difficult to ascertain objectively. Rigor in assessing isolation and randomization of selection would be needed for future work. Second, the choice of the urban "control" is also based on convenience. While the urban community has a population exceeding 110,000, it is by no means a major metropolitan center, as reflected in the fact that its population appeared to be related to just one European subpopulation (i.e., British) and is therefore potentially problematic to use as an urban control population. This could be because the sample urban population was not fully representative of the whole population, or perhaps this particular urban population is not as genetically diverse as others. Future studies could use a larger (or multiple) urban community to circumvent this potential problem. Third, follow-up studies that trace the history of settlement of these communities could complement and potentially substantiate the findings of the work presented here.

Another important limitation of this study lies in the genotyping technologies used to obtain markers in the IsoPop and 1KGP data sets. In addition to the potential for ascertainment bias inherent in using arrays such as Illumina (described in Lipka et al. 2015), additional bias could arise from the fact that an Illumina chip was used to call markers in the IsoPop data set whereas whole genome sequencing was used in the 1KGP data set. However, our results suggest that such an ascertainment bias could be minimal. For example, there is close proximity between the IsoPop and 1KGP individuals in Figures 4 and 5. We also observed a similar distribution of rare and common SNPs in the IsoPop individuals and the 503 1KGP individuals of European descent (Supplementary Figure S12, Supplementary Table S3), as well as similar linkage disequilibrium patterns (Supplementary Figure S13). Nevertheless, future studies should use the same sequencing platforms to obtain markers in all data sets that are evaluated. Finally, we encourage future studies to compare data from rural isolated communities from the US Midwest with marker data from other publicly available data sets besides the 1KGP data set that include more than just the five subpopulations of European descent, such as POPRES (Nelson et al. 2008). Such a comparison could shed further light on the number of source populations underlying these isolated communities. [End Page 38]

Conclusions

This study utilized nearly 1 million high-quality SNPs and, to the best of our knowledge, is the first to use SNP data to examine both population structure and founder effects in nonreligious rural isolated populations in the US Midwest. The potential impact of founder effects on the genetic diversity of rural communities over hundreds of years could be the source of future studies. For example, these studies could consider advanced statistical approaches for quantifying such effects and, moreover, parse out these effects on the population over multiple generations, from the founding of the population to the present day. Lastly, other SNP chips or whole-genome sequencing could be used to obtain a larger marker set (and thus capture an even greater amount of genomic diversity) and in a combined analysis with these IsoPop individuals and other publicly available data sets.

Correspondence to: Alexander E. Lipka, Department of Crop Sciences, University of Illinois Urbana–Champaign, W-210A Turner Hall, 1102 S Goodwin Ave., Urbana, IL 61801 USA. E-mail: alipka@illinois.edu.
Received 3 April 2019; accepted for publication 12 November 2019.

literature cited

Alexander, D. H., J. Novembre, and K. Lange. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19:1,655–1,664.
Arbitrio, M., F. Scionti, E. Altomare et al. 2019. Polymorphic variants in NR1I3 and UGT2B7 predict taxane neurotoxicity and have prognostic relevance in breast cancer patients: A case-control study. Clin. Pharmacol. Ther. 106:422–431.
Befort, C. A., N. Nazir and M. G. Perri. 2012. Prevalence of obesity among adults from rural and urban areas of the United States: Findings from NHANES (2005–2008). J. Rural Health 28:392–397.
Birney, E., and N. Soranzo. 2015. Human genomics: The end of the start for population sequencing. Nature 526:52–53.
Bradbury, P. J., Z. Zhang, D. E. Kroon et al. 2007. TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23:2,633–2,635.
Campbell, C. D., E. L. Ogburn, K. L. Lunetta et al. 2005. Demonstrating stratification in a European American population. Nat. Genet. 37:868–872.
Cann, H. M., C. de Toma, L. Cazes et al. 2002. A human genome diversity cell line panel. Science 296:261–262.
Cardoso, S., R. Sevillano, D. Gamarra et al. 2017. Population genetic data of 38 insertion-deletion markers in six populations of the northern fringe of the Iberian Peninsula. Forensic Sci. Int. Genet. 27:175–179.
Danecek, P., A. Auton, G. Abecasis et al. 2011. The variant call format and VCFtools. Bioinformatics 27:2,156–2,158.
Dean, C., A. J. Fogleman, W. E. Zahnd et al. 2017. Engaging rural communities in genetic research: Challenges and opportunities. J. Community Genet. 8:209–219.
De Vivo, I., J. Prescott, V. W. Setiawan et al. 2014. Genome-wide association study of endometrial cancer in E2C2. Hum. Genet. 133:211–224.
Excoffier, L., and H. E. Lischer. 2010. Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Mol. Ecol. Resour. 10:564–567.
Falush, D., M. Stephens, and J. K. Pritchard. 2003. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164:1,567–1,587.
Henry, K. A., K. McDonald, R. Sherman et al. 2014. Association between individual and geographic factors and nonadherence to mammography screening guidelines. J. Womens Health 23:664–674.
Hines, R. B., and T. W. Markossian. 2012. Differences in late-stage diagnosis, treatment, and colorectal cancer-related death between rural and urban African Americans and whites in Georgia. J. Rural Health 28:296–305.
International HapMap Consortium. 2003. The International HapMap Project. Nature 426:789–796.
Jenkins, W. D., A. E. Lipka, A. J. Fogleman et al. 2016. Variance in disease risk: Rural populations and genetic diversity. Genome 59:519–525.
Lipka, A. E., C. B. Kandianis, M. E. Hudson et al. 2015. From association to prediction: Statistical methods for the dissection and selection of complex traits in plants. Curr. Opin. Plant Biol. 24:110–118.
Lischer, H. E., and L. Excoffier. 2012. PGDSpider: An automated data conversion tool for connecting population genetics and genomics programs. Bioinformatics 28:298–299.
Machiela, M. J., W. Zhou, E. Karlins et al. 2016. Female chromosome X mosaicism is age-related and preferentially affects the inactivated X chromosome. Nat. Commun. 7:1–9.
Mayba, O., F. Gnad, M. Peyton et al. 2014. Integrative analysis of two cell lines derived from a non-small-lung cancer patient—a panomics approach. Pac. Symp. Biocomput. 75–86.
Nei, M. 1987. Molecular Evolutionary Genetics. New York: Columbia University Press.
Nelson, M. R., K. Bryc, K. S. King et al. 2008. The Population Reference Sample, POPRES: A resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 83:347–358.
Peltonen, L., A. Palotie, and K. Lange. 2000. Use of population isolates for mapping complex traits. Nat. Rev. Genet. 1:182–190.
Portas, L., F. Murgia, G. Biino et al. 2010. History, geography and population structure influence the distribution and heritability of blood and anthropometric quantitative traits in nine Sardinian genetic isolates. Genet. Res. (Camb.) 92:199–208.
Price, A. L., N. J. Patterson, R. M. Plenge et al. 2006. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38:904–909.
Pritchard, J. K., M. Stephens, N. A. Rosenberg et al. 2000. Association mapping in structured populations. Am. J. Hum. Genet. 67:170–181.
Purcell, S., B. Neale, K. Todd-Brown et al. 2007. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81:559–575.
RStudio Team. 2015. RStudio: Integrated Development for R. Boston, MA: RStudio, Inc. http://www.rstudio.com/.
Rudan, I. 1999. Inbreeding and cancer incidence in human isolates. Hum. Biol. 71:173–187.
Ryu, D. H., H. T. Yu, S. A. Kim et al. 2018. Is chronic exposure to low-dose organochlorine pesticides a new risk factor of T-cell immunosenescence? Cancer Epidemiol. Biomarkers Prev. 27:1,159–1,167.
Sherry, S. T., M. H. Ward, M. Kholodov et al. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29:308–311.
Steffens, M., C. Lamina, T. Illig et al. 2006. SNP-based analysis of genetic substructure in the German population. Hum. Hered. 62:20–29.
Tajima, F. 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437–460.
Terao, C., H. Yoshifuji, A. Kimura et al. 2013. Two susceptibility loci to Takayasu arteritis reveal a synergistic role of the IL12B and HLA-B regions in a Japanese population. Am. J. Hum. Genet. 93:289–297.
Thomas, D. C., and J. S. Witte. 2002. Point: Population stratification: A problem for case-control studies of candidate-gene associations? Cancer Epidemiol. Bio-markers Prev. 11:505–512.
1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526:68–74.
Vaags, A. K., A. C. Lionel, D. Sato et al. 2012. Rare deletions at the neurexin 3 locus in autism spectrum disorder. Am. J. Hum. Genet. 90:133–141.
Verdu, P., T. J. Pemberton, R. Laurent et al. 2014. Patterns of admixture and population structure in native populations of Northwest North America. PLoS Genet. 10:1–17.
Wacholder, S., N. Rothman, and N. Caporaso. 2000. Population stratification in epidemiologic studies of common genetic variants and cancer: Quantification of bias. J. Natl. Cancer Inst. 92:1,151–1,158.
Wang, S., C. M. Lewis Jr., M. Jakobsson et al. 2007. Genetic variation and population structure in Native Americans. PLoS Genet. 3:2,049–2,067.
Wright, S. 1950. Genetical structure of populations. Nature 166:247–249.
Zheng, X., and B. S. Weir. 2016. Eigenanalysis of SNP data with an identity by descent interpretation. Theor. Popul. Biol. 107:65–76.

Footnotes

1. Program in Ecology, Evolution and Conservation Biology, University of Illinois Urbana–Champaign, Urbana, Illinois, USA.

2. Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana–Champaign, Urbana, Illinois, USA.

3. Department of Crop Sciences, University of Illinois Urbana–Champaign, Urbana, Illinois, USA.

4. Department of Population Science and Policy, Southern Illinois University School of Medicine, Springfield, Illinois, USA.

5. Rural and Minority Health Research Center, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina, USA.

6. Epidemiology and Biostatistics, Southern Illinois University School of Medicine, Springfield, Illinois, USA.

7. Department of Anthropology, College of Liberal Arts and Sciences, University of Illinois Urbana–Champaign, Urbana, Illinois, USA.

________

Supplementary Table S1. IsoPop Individuals by Population, and Quality Assessments Made by Qubit Concentration and the Illumina Call Rate

Go to the following link to view the table: https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?filename=0&article=1151&context=humbiol_preprints&type=additional

No description available
Click for larger view
View full resolution

Supplementary Table S2. Genetic Diversity Statistics for Four IsoPop Populations, Individually and Combined

No description available
Click for larger view
View full resolution

Supplementary Table S3. Proportions of Common and Rare Markers among IsoPop Individuals and Individuals of European Descent (EUR) from the 1000 Genomes Database

Supplementary Figure S1. Heat map depicting values of the coancestry matrix for all 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals. The actual numerical coancestry values between each pair of individuals are provided in Supplementary Table S1.
Click for larger view
View full resolution
Supplementary Figure S1.

Heat map depicting values of the coancestry matrix for all 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals. The actual numerical coancestry values between each pair of individuals are provided in Supplementary Table S1.

Supplementary Figure S2. Scree plot for the eigenanalysis of all 157 IsoPop individuals. The x-axis indicates the index of eigenvalues, and the y-axis indicates the numerical value of each eigenvalue.
Click for larger view
View full resolution
Supplementary Figure S2.

Scree plot for the eigenanalysis of all 157 IsoPop individuals. The x-axis indicates the index of eigenvalues, and the y-axis indicates the numerical value of each eigenvalue.

Supplementary Figure S3. EIGMIX plot of all 157 IsoPop individuals. The four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different symbols. The x- and y-axes are the value of the first and second eigenvector, respectively.
Click for larger view
View full resolution
Supplementary Figure S3.

EIGMIX plot of all 157 IsoPop individuals. The four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different symbols. The x- and y-axes are the value of the first and second eigenvector, respectively.

Supplementary Figure S4. EIGMIX plot of all 157 IsoPop individuals. The four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different colors. The x- and y-axes are the value of the first and third eigenvector, respectively.
Click for larger view
View full resolution
Supplementary Figure S4.

EIGMIX plot of all 157 IsoPop individuals. The four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different colors. The x- and y-axes are the value of the first and third eigenvector, respectively.

Supplementary Figure S5. EIGMIX plot of all 157 IsoPop individuals. The four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different colors. The x- and y-axes are the value of the second and third eigenvector, respectively.
Click for larger view
View full resolution
Supplementary Figure S5.

EIGMIX plot of all 157 IsoPop individuals. The four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different colors. The x- and y-axes are the value of the second and third eigenvector, respectively.

Supplementary Figure S6. Scree plot for the eigenanalysis of all 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals. The x-axis indicates the index of eigenvalues, and the y-axis indicates the numerical value of each eigenvalue.
Click for larger view
View full resolution
Supplementary Figure S6.

Scree plot for the eigenanalysis of all 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals. The x-axis indicates the index of eigenvalues, and the y-axis indicates the numerical value of each eigenvalue.

Supplementary Figure S7. EIGMIX plot of all 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals. The world populations of Africa (AFR), Americas (AMR), East Asia (EAS), South Asia (SAS), and Europe (EUR), as well as the four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different colors. The x- and y-axes are the value of the third and first eigenvector, respectively.
Click for larger view
View full resolution
Supplementary Figure S7.

EIGMIX plot of all 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals. The world populations of Africa (AFR), Americas (AMR), East Asia (EAS), South Asia (SAS), and Europe (EUR), as well as the four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different colors. The x- and y-axes are the value of the third and first eigenvector, respectively.

Supplementary Figure S8. EIGMIX plot of all 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals. The world populations of Africa (AFR), Americas (AMR), East Asia (EAS), South Asia (SAS), and Europe (EUR), as well as the four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different colors. The x- and y-axes are the value of the third and second eigenvector, respectively.
Click for larger view
View full resolution
Supplementary Figure S8.

EIGMIX plot of all 2,504 individuals from 1000 Genomes database and 157 IsoPop individuals. The world populations of Africa (AFR), Americas (AMR), East Asia (EAS), South Asia (SAS), and Europe (EUR), as well as the four IsoPop populations, labeled 01, 02, 03, and 04, are represented by different colors. The x- and y-axes are the value of the third and second eigenvector, respectively.

Supplementary Figure S9. Scree plot for the eigenanalysis of 503 individuals of European descent from the 1000 Genomes database and 157 IsoPop individuals. The x-axis indicates the index of eigenvalues, and the y-axis indicates the numerical value of each eigenvalue.
Click for larger view
View full resolution
Supplementary Figure S9.

Scree plot for the eigenanalysis of 503 individuals of European descent from the 1000 Genomes database and 157 IsoPop individuals. The x-axis indicates the index of eigenvalues, and the y-axis indicates the numerical value of each eigenvalue.

Supplementary Figure S10. Eigen plot of 503 individuals of European descent from the 1000 Genomes database and 157 IsoPop individuals using EIGMIX. The x- and y-axes are the value of the first and third eigenvector, respectively. Individuals from each European subpopulation represented in the 1000 Genomes Project, as well as the each of the four IsoPop communities, labeled 01, 02, 03, and 04, are colored differently. The European (EUR) populations are plotted with the following abbreviations: Utah residents in CEPH (CEU), Finland (FIN), British in England and Scotland (GBR), Iberian populations in Spain (IBS), and Toscani in Italia (TSI).
Click for larger view
View full resolution
Supplementary Figure S10.

Eigen plot of 503 individuals of European descent from the 1000 Genomes database and 157 IsoPop individuals using EIGMIX. The x- and y-axes are the value of the first and third eigenvector, respectively. Individuals from each European subpopulation represented in the 1000 Genomes Project, as well as the each of the four IsoPop communities, labeled 01, 02, 03, and 04, are colored differently. The European (EUR) populations are plotted with the following abbreviations: Utah residents in CEPH (CEU), Finland (FIN), British in England and Scotland (GBR), Iberian populations in Spain (IBS), and Toscani in Italia (TSI).

Supplementary Figure S11. Eigen plot of 503 individuals of European descent from the 1000 Genomes database and 157 IsoPop individuals using EIGMIX. The x- and y-axes are the value of the second and third eigenvector, respectively. Individuals from each European subpopulation represented in the 1000 Genomes Project, as well as the each of the four IsoPop communities, labeled 01, 02, 03, and 04, are colored differently. The European (EUR) populations are plotted with the following abbreviations: Utah residents in CEPH (CEU), Finland (FIN), British in England and Scotland (GBR), Iberian populations in Spain (IBS), and Toscani in Italia (TSI).
Click for larger view
View full resolution
Supplementary Figure S11.

Eigen plot of 503 individuals of European descent from the 1000 Genomes database and 157 IsoPop individuals using EIGMIX. The x- and y-axes are the value of the second and third eigenvector, respectively. Individuals from each European subpopulation represented in the 1000 Genomes Project, as well as the each of the four IsoPop communities, labeled 01, 02, 03, and 04, are colored differently. The European (EUR) populations are plotted with the following abbreviations: Utah residents in CEPH (CEU), Finland (FIN), British in England and Scotland (GBR), Iberian populations in Spain (IBS), and Toscani in Italia (TSI).

Supplementary Figure S12. Empirical density (y-axis) of the dierences in minor allele frequencies (MAFs) of 999,259 markers among the 157 IsoPop individuals and the 503 individuals of European descent in the 1000 Genomes database. The mode of this density is centered at 0, suggesting that the overwhelming majority of these markers have similar MAFs in both of these data sets.
Click for larger view
View full resolution
Supplementary Figure S12.

Empirical density (y-axis) of the dierences in minor allele frequencies (MAFs) of 999,259 markers among the 157 IsoPop individuals and the 503 individuals of European descent in the 1000 Genomes database. The mode of this density is centered at 0, suggesting that the overwhelming majority of these markers have similar MAFs in both of these data sets.

Supplementary Figure S13. Linkage disequilibrium decay plots among the 2,504 individuals in the 1000 Genomes database (1KGP), the 503 individuals of European descent (EUR) in the 1000 Genomes database, all 157 individuals of the IsoPop population (ISOPOP), and the four individual IsoPop communities (ISOPOP1–4). For each graph, the y-axis is the squared Pearson correlation coefficient (r2) between marker pairs, and the x-axis depicts the physical distance between markers, from 0 to 1,000 kb (A) and from 0 to 300 kb (B). Note that the linkage disequilibrium decay is higher for the urban population (ISOPOP4) compared to the three rural communities (ISOPOP1–3), the opposite of what would be expected should a founder effect exist in these rural communities.
Click for larger view
View full resolution
Supplementary Figure S13.

Linkage disequilibrium decay plots among the 2,504 individuals in the 1000 Genomes database (1KGP), the 503 individuals of European descent (EUR) in the 1000 Genomes database, all 157 individuals of the IsoPop population (ISOPOP), and the four individual IsoPop communities (ISOPOP1–4). For each graph, the y-axis is the squared Pearson correlation coefficient (r2) between marker pairs, and the x-axis depicts the physical distance between markers, from 0 to 1,000 kb (A) and from 0 to 300 kb (B). Note that the linkage disequilibrium decay is higher for the urban population (ISOPOP4) compared to the three rural communities (ISOPOP1–3), the opposite of what would be expected should a founder effect exist in these rural communities.

Additional Information

ISSN
1534-6617
Print ISSN
0018-7143
Pages
31-47
Launched on MUSE
2020-02-20
Open Access
No
Back To Top

This website uses cookies to ensure you get the best experience on our website. Without cookies your experience may not be seamless.