Human genetic structure in Northwest France provides new insights into West European historical demography

France administrative division overviewThis study relies on a dense sampling of the geographical region of western France, which crosses multiple administrative subdivisions of the territory. Among these subdivisions are the (1) regions – the largest administrative units – which are subdivided into (2) départements. A département is subdivided into (3) arrondissements, which at the smallest scale are divided into (4) town centres. Given the specificity of the administrative system we kept the French name for the département and arrondissement along the manuscript.Cohort descriptionThe PREGO study (“Population de référence du Grand Ouest”, https://umr1087.univ-nantes.fr/prego-biobank) collected DNA from 5707 healthy persons originating from western France (Regions Pays de la Loire and Bretagne). Individuals were recruited during 295 blood drives organised by the French Blood Service (EFS in French) carried out between February 2014 and March 2017, with a mean of 19 donors per blood drive. Blood drives were spatially and temporally sampled in order to obtain a coverage as homogeneous as possible of the nine départements included in the study. Priority was given to blood drives taking place in rural areas. Participants should be native of western France. Individuals’ birthplace was assessed by the place of birth of the four grandparents. Only individuals whose four grandparents were born in western France and preferably within a radius of 30 km were included in this study. From the 3234 individuals included in the present study, 25%, 50% and 75% have their grandparents born 3.25, 6.38 and 12.33 kms from each other, respectively. Venous blood samples (6 ml) were collected from recruited individuals by venipuncture into Vacutainer tubes. Participants filled out a questionnaire providing grandparents’, parents’ and their own birthplaces, residence, age, sex and information about previous participation in the study (of the individual itself or another member of the family). Neither phenotypic nor clinical data was collected in the present study. The PREGO study received approvals from the local ethical committee of Nantes (Comité de Protection des Personnes), the Advisory Committee on Information Processing for Health Research (CCTIRS) and the National Commission on Informatics and Liberty (CNIL). Participants signed a written informed consent for participation in the study, inclusion in bio-resource and personal data processing.The FranceGenRef study aims to describe patterns of population diversity across metropolitan France at the beginning of the 20th century. Thus, individuals were sampled based on the birthplace of their grandparents, whose distance should not exceed 30 kms. FranceGenRef includes 862 individuals satisfying the aforementioned criteria: 354 blood donors from the PREGO cohort (described above) with origin in the départements of Côtes d’Armor (COT), Ille-et-Vilaine (ILL), Morbihan (MOR), Loire-Atlantique (LOI), Maine-et-Loire (MAI), Mayenne (MAY), Sarthe (SAR) and Vendée (VEN); 50 blood donors from Finistère (FIN); 458 individuals from the GAZEL cohort (www.gazel.inserm.fr/en)66,67 originating from five other regions of France: Normandie, Hauts-de-France, Grand East, Centre-Val de Loire and Nouvelle-Aquitaine. The GAZEL study received approvals from the National Commission for Data Processing and Freedoms (CNIL), the National Medical Council and the National Consultative Committee of Ethics. All individuals signed informed consent for genetic studies at the time of enrolment. DNA samples from GAZEL samples were extracted at the CEPH Biobank on an automated system either Autopure (Qiagen) or Chemagic Prime (PerkinElmer) using respectively the salting out method or magnetic beads and were quantified using fluorimetry (Quant-iT DNA Assay kit, Broad Range, Thermo Fisher Scientific).Genotyping, whole-genome sequencing and quality control (QC)SNP array genotype processingUnder the scope of PREGO, genomic DNA was extracted from peripheral blood lymphocytes by standard protocols. Out of 5707 collected samples, 3385 were genotyped on the Axiom TM Precision Medicine Research Array (~920,000 markers, ThermoFisher). Standard QC was performed using SNPolisher software (http://tools.thermofisher.com) and SNPs not passing the QC report were removed according to the manufacturer’s instructions. SNPs with a missing rate >5%, minor allele frequency <10% and not in Hardy-Weinberg Equilibrium (p < 10−6) were excluded from the dataset resulting in a total of 210,171 SNPs. Relatedness was assessed via the PI_HAT statistics, which provides an estimation of the proportion of the genome shared by two pairs of individuals (i.e., identity-by-descent, IBD). PI_HAT statistics was computed using PLINK (vs.1.9). After removing related individuals with a PI_HAT > = 0.08, there were 3234 samples left for analysis. Genotyping was conducted in three batches of 971, 1266, and 997 individuals, respectively. To investigate potential batch effects, we employed linear regression using a batch indicator variable on the first five principal components. We found no significant association at the significance level of 0.05.Whole-genome sequencing and variant callingWhole-genome sequencing of 856 individuals was performed at CNRGH (Evry, France) using their standard workflow (www.cnrgh.fr). All the samples were sequenced at high coverage (average coverage 30x). Read processing was performed with GATK 3.8 following the “best practices” NGS pipeline recommendations of the Broad Institute (https://software.broadinstitute.org/gatk/best-practices). Reads were mapped on the GRCh37 reference genome using bwa-mem, duplicates were removed and reads were then realigned and recalibrated according to GATK best practices. Variant calling was performed with GenotypeGVCF. GVCF files were then merged using GATK CombineGVCF function, recalibrated and annotated with SnpEff68 and the gnomAD database (https://gnomad.broadinstitute.org/). The resulting GVCF files were filtered out in order to only keep SNPs with high mapping quality (MQ > 30). Furthermore, only SNPs in Hardy-Weinberg equilibrium (HWE, p = 10−5) and less than 10% of missing data were kept for analysis. Related samples were identified using PI_HAT statistics computed with PLINK (vs.1.9) and excluded when values were >0.10. Individuals whose grand-parents were not born in the same département were also excluded leaving us with 843 samples. The sizes of the samples used in each analysis are specified below and in every corresponding figure.aDNA Mediaeval samplesWe generated whole-genome data for six ancient individuals from three different archaeological sites in western France, specifically from the region of Pays-de-la-Loire. Samples fra001, fra004, fra008 and fra009 were dated using radiocarbon methods and estimates vary from 375 to 1024 cal. AD (Table S2, Figs. S27–S29). This interval corresponds to the early and High Mediaeval periods. The dates of the other two samples – fra016 and fra017 – are based on the archaeological context69. Out of the six ancient individuals, four (fra001, fra004, fra016, fra017) were sampled south of the river Loire while two were sampled north of the Loire (fra008, fra009, Fig. 5). The archaeological excavations in Saint-Lupien, Rezé (south shore of the Loire) took place between 2005 and 2016 and were led by the team of Mikaël Rouzic under the request of the city of Rezé. The site located in Chaussé Saint-Pierre, Angers (north shore of the Loire), was excavated by the team of Martin Pithon, from the French Institute for Preventive Archaeological Research (INRAP) and the excavations occurred between July and August 2009. Based on the archaeological remains and radiocarbon dates, the site shows evidence of occupation from the beginning of the Roman Empire to modern times. The project was initialised under the request of the city of Angers given the plan for public construction affecting the archaeological site. The archaeological study of the site was authorised by the Regional Division for Cultural Affairs (Délégation Régional des affaires culturelles) and the INRAP. Finally, the excavations in the archaeological site in Chéméré (south shore of the Loire) started in the 60 s but the two individuals sequenced in this study belong to a group of 181 individuals found in the last excavations in 2007. These excavations were led by an INRAP archaeological team under a project of preventive archaeology before construction of a pavilion. For more details on the description of the sites see Supplementary Material Online (Supplementary Archaeological Details).aDNA library preparation and bioinformatic processingOne DNA extraction (following a modified version of the extraction method from70,71,72) and one to two single indexed blunt end libraries73 were prepared for each of the six ancient samples (fra001, fra004, fra008, fra009, fra016 and fra017). All DNA libraries were double-stranded. DNA extractions and library preparations were performed in the dedicated clean aDNA-laboratory at the Uppsala University, Sweden. The DNA libraries were sequenced in two batches and over multiple lanes, first as a pilot run at the SciLife Sequencing Centre in Uppsala, Sweden, using Illumina HiSeq 2500 with paired-end 125 bp chemistry, and later in more depth at the CNRGH (Evry, France) using Illumina HiSeq X and with a paired-end 150 bp chemistry. Raw reads were trimmed with CutAdapt version 2.374 using the parameters: –quality-base 33, –quality-cutoff 15, -e 0.2, –trim-n and –minimum-length 15. Overlapping read pairs were then merged using FLASH version 1.2.1175 and with parameters: –min-overlap 11, –max-overlap 150 and –allow-outies. Merged fastq files were mapped to the human reference genome hs37d5 as single end reads using bwa-aln version 0.7.1776 and parameters -l 16500, -n 0.01 and -o 2, as suggested for ancient DNA77,78.BAM files were merged to a per sample library level using the merge command in Samtools version 1.579 followed by PCR duplicates removal (reads with identical start and end positions were identified and collapsed) using a modified version of FilterUniqSAMPCons_cc.py, which ensures random assignment of bases in a 50/50 case, as described in ref. 80. All reads longer than 35 base pairs and with less than 10% mismatches to the reference were kept, and a final merging step was performed for those samples with two sequenced libraries by merging the processed sample library BAMs to a final sample BAM.Processed sample BAM files were then used to call pseudo-haploid genotypes using Samtools and the option mpileup -R -B -q30 -Q30. In order to merge with available datasets, the pseudo-haploid calling was carried out on the 593,124 autosomal sites on the Human Origins Array (Affymetrix). Contamination was estimated based on the X-chromosome and mitochondrial DNA using ANGSD81 and schmutzi82, respectively. Sample quality, inferred sex and contamination estimates are shown in Table S2.Merging with publicly available datasetsPublicly available datasets of Western EuropeansTo investigate the relationship of modern individuals from the north part of France and other European populations we merged the WGS dataset with three available genome-wide datasets encompassing a large number of Western European samples: (1) the EGAD00000000120 from the The International Multiple Sclerosis Genetics Consortium and the Wellcome Trust Case Control Consortium 2 (referred to hereafter as the MS dataset)43, (2) the EGAD00010000124 from the Genetic Analysis of Psoriasis Consortium & the Wellcome Trust Case Control Consortium 2 (referred to hereafter as PS dataset)42, and (3) the EGAD00010000632 from the Peopling of the British Isles (referred to hereafter as POBI5). In the MS and PS datasets samples were genotyped on the Human670-QuadCustom SNPchip encompassing 580,030 autosomal sites and the datasets include 11,376 and 2622 individuals, respectively. MS dataset includes samples from: Australia, Belgium, Denmark, Germany, Finland, France, Italy, New Zealand, Northern Ireland, Norway, Poland, Spain, Sweden, US and UK, whereas the PS dataset includes samples from UK and Ireland. In the POBI dataset, 2912 individuals from the UK were genotyped on the Human1-2M-DuoCustom SNPchip encompassing 1,115,428 autosomal sites. For the three datasets, original genotype likelihood files (.gen) were converted into plink format files with gtool (vs. 0.7.5). Only individuals and sites passing quality criteria thresholds (provided with the datasets) in the original studies were kept. Genotypes were called using a probability cut-off of 0.90. Sites containing alleles in the negative strand were flipped according to the corresponding strand file https://www.well.ox.ac.uk/~wrayner/strand/ using PLINK vs.1.9. First, we checked whether the alleles were in the illumina TOP configuration, as required to flip based on the strand file. Sites not found in TOP configuration were removed from the datasets. We used the liftOver tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver) to convert the physical coordinates to hg19 as they were originally in hg18 in the MS, PS and POBI datasets. We merged the samples from Belgium, Denmark, Germany, Finland, France, Italy, Northern Ireland, Norway, Poland, Spain, Sweden and UK with the Irish samples from the PS dataset. In a second step, we merged this dataset with a subset (to keep the sample size computationally tractable) of the POBI dataset comprising two samples from Wales (Dyfed and Gwynedd), the sample from Cornwall and two samples from Norfolk and Kent (eastern UK). Finally, we merged the dataset above with seventy samples from the available Human Genome Diversity Project dataset (https://www.hagsc.org/hgdp/files.html) genotyped on the Illumina 650Y SNPchip array and originated from Sardinia, the Basque Country, Orkney Islands and the Mbuti (to use as an outgroup). Chromosomal coordinates for the HGDP dataset were lifted over as described above. From this merged dataset we filtered out sites not in Hardy-Weinberg Equilibrium (HWE, p = 10−5), removed those falling into the HLA region (chr6:29,691,116–33,054,976) and removed all the multiallelic sites. The final dataset, hereafter referred to as “merged modern dataset”, encompassed 9704 samples and 433,940 SNPs.Human origins array and Viking datasetsWe merged the French WGS dataset with the Human Origins Array (HOA) dataset V42.4 (https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data) encompassing 3589 ancient and 6472 present-day individuals genotyped for 593,124 autosomal SNPs. From the original dataset we extracted European samples (Austria, Belgium, Czech_Republic, Denmark, France, Germany, Great_Britain, Greece, Hungary, Iceland, Ireland, Italy, Luxembourg, Netherlands, Norway, Poland, Portugal, Russia, Spain, Sweden, Switzerland, Turkey) with PASS tag under the “Assessment” field in the annotation file [see references in the Supplementary Discussion]. From this subset we removed samples whose individuals’ ID contain Ignore and with less than 50,000 genotypes called to avoid potential bias. This dataset was then merged with 405 ancient DNA samples belonging to Vikings from the study: “Population Genomics of the Viking world”50 and previously called for HOA set of SNPs. This dataset is referred to hereafter as “merged ancient dataset”. Mbuti samples (HGDP-CEPH, Human Genome Diversity Project-Centre d’Etude du polymorphisme Humain – https://cephb.fr/en/hgdp_panel.php) from the HOA were also extracted to compute f-statistics (see below).ChromoPainter and fineSTRUCTUREPREGO’s SNP-array dataset (post-QC) was phased with SHAPEIT v2.r79083 using the genetic map build 37 provided with the software and no reference panel. Phased genotype files were converted into CHROMOPAINTER46 format using the impute2 chromopainter2.pl script. Switch (global Ne) and emission rates (μ) were estimated with CHROMOPAINTER version 2 using data on chromosomes 1, 4, 10, and 15 from 330 individuals (around 10% of the total sample). The estimated parameters were then used to run CHROMOPAINTER on the full data. A Principal Component Analysis (PCA) on the coancestry matrix (chunkcounts output) was performed in R. Coancestry matrix estimates the proportion of the genome of each individual that is most closely related to every other individual in the matrix. In particular, chunkcounts matrix is based on the number of copied haplotype chunks. We assigned individuals to clusters using the model-based approach implemented in fineSTRUCTURE version 2.1.3. We ran fineSTRUCTURE version 2.1.3 on the coancestry matrix with 10,000,000 burn-in iterations and 1,000,000 MCMC iterations, from which every 10,000th iteration was recorded. We kept the default values for the other options. The tree was built using 100,000 tree comparisons and 10,000,000 additional optimisation steps. MCMC convergence was assessed by comparing the individual assignment to clusters in a second independent chain. Confidence measures of cluster assignment and visualisation of coancestry matrix were performed with the help of the R library FinestructureLibrary provided with the software. The tree provided by fineSTRUCTURE software is based on posterior probabilities of the population configuration (ie. individual partition across clusters), which is highly dependent on the sample sizes and does not reflect differences between pairs of clusters31,84. Therefore, as an alternative we computed a tree based on the total variation distance (TVD) as firstly described by Kerminen and colleagues31. TVD measures the distance between the clusters and it is not affected by differences in sample sizes. TVD values are computed from the copying profiles of all individuals, estimated with CHROMOPAINTER, and the cluster assignment inferred with fineSTRUCTURE as described in ref. 31. It results in a final symmetrical matrix with as many rows and columns as the number of clusters inferred (K), where each entry represents the average closest ancestry contribution in cluster k coming from other clusters. We computed the TVD matrix on the final fineSTRUCTURE partition (k = <154, finest level of FS-tree) and the TVD-tree was obtained using complete linkage hierarchical clustering.We assessed the performance of the two tree building approaches on the PREGO dataset. The performance was measured by comparing cluster assignment confidence as computed in Leslie et al.5. As expected, cluster assignment confidence decreases with increasing number of clusters (Fig. S6). Moreover, we found that for the same k the FS-tree shows lower cluster assignment confidence than the TVD-based tree, confirming the better performance of the latter. To visualise the relationship between the clusters (Fig. 1d), we arbitrarily chose k = 39 because it provides a large enough number of clusters to access fine-scale structure while keeping cluster assignment confidence >90% (Fig. S6). At this level of the tree, clusters containing 1–5 individuals were merged into the closest cluster with >=31 individuals or removed if the closest cluster included itself <4 individuals. This approach resulted in a TVD-based tree of 18 clusters performing better than FS tree, which achieves similar values of cluster assignment confidence for only 12 clusters. Finally, we tested whether inferred clusters capture significant differences in ancestry by following the approach of ref. 5. To do so, we randomly reassigned the individuals to clusters, maintaining the cluster sizes, and computed the p-value of the population configurations produced by the k = 18. The TVD-based tree was less likely than a random distribution of individuals across pairs of clusters. We performed 1000 permutations to obtain p values, which were computed as the number of permutations where random assignment resulted in a higher value of TVD between any pair of clusters and the total number of permutations. For the level of k = 18, p values were <0.001 for all pairs of clusters.Assessing sampling effect on fineSTRUCTURE clusteringWe investigated whether the sampling scheme carried out in this study, which densely selects individuals representing every location over four generations, increases the probability of having recently related individuals. This recent ancestry may affect clustering patterns. To check the effect of recent ancestry on the fineSTRUCTURE clusters we computed a relatedness measure based on identity by state (IBS) within clusters, for k = 3, 18 and 78 (only for those clusters with n > 10 at the finest scale level k = 154), and compared it with the overall distribution. With the exception of k = 154, where six clusters have a median IBS-statistics larger than the 75% of overall IBD distribution, across the other values of k, most of the clusters do not show increased relatedness (Figs. S2–S4) and therefore, we conclude that recent ancestry is not driving the results.Surname analysisThe list of surnames associated with birth registration data and occurring in two periods (period one: 1891–1915 and period two: 1816–1940) were retrieved from the French Institute for Statistics and Economic Studies, the INSEE. Surname lists are available at the town level. To analyse the distribution of surnames we proceed as follows. (1) For each town, we chose the surnames present in the two periods and registered at least four times, ie. given to four newborns. Such an approach removes very rare names often associated with spelling errors in the registration and rare emigrants from other regions of France or elsewhere. (2) summed the number of surname occurrences over the two periods at the arrondissement level. There are 35 arrondissements within the ten selected départments. (3) We computed the Arccos distance (68) between the 35 arrondissements and (4) constructed a consensus tree based on 1000 bootstrapped distance matrices obtained by the Neighbour Joining method. Finally, we built a map linking the arrondissements in a nested way according to the bootstrap values of the NJ tree. The threshold for grouping arrondissements was 90% bootstrap robustness while the threshold for higher order grouping is 85%.Correlation between surname based Arccos distance and Fst was performed using Mantel distance matrix correlation test using Spearman rank correlation. The physical distance between arrondissements was done using partial correlation85. Surname distribution diversity index were entropy and the Barrai index. Let N be the population size in a geographical area and S the number of different surnames and \({p}_{i}\) the probability of Surname i..entropy = \(\left[-{\sum }_{i=1}^{S}{pi} \, * \,{Ln}({pi})\right]\)Effective migration surface (EEMS) analysisWe estimated effective migration surfaces using the software EEMS32. The matrix of average pairwise genetic dissimilarities was generated for 82,362 SNPs (after prunning: –indep-pairwise 50 5 0.2) and 1414 individuals using the bed2diffs software included in the EEMS package. Samples were assigned to the nearest of 300 demes. We run ten independent MCMC chains, each with a random seed, for 10,000,000 iterations, including 9,900,000 burn-in iterations, thinning every 200 iterations. From the chain with the highest final log-likelihood, and we started a second round of ten EEMS chains using this chain as a starting point for 1,000,000 additional sampling iterations, thinning every 9999 iterations. Log- posterior trace of ten replicate MCMC chains from the last round shows mixing and convergence of the independent EEMS runs. Plots were generated in R statistical software using the rEEMSplots package.PCA analysis and Fst computationPCAs was carried out using the smartpca software from the EIGENSOFT package version 6.1.486. FST values were obtained by setting up the option fsthiprecision: YES. Both analyses were performed after pruning for linkage disequilibrium using a sliding window of 50 SNPs in steps of 5 SNPs and keeping those SNPs with r2 < 0.50 (Figs. 3a and S20). The regions previously reported as exhibiting long-range linkage-disequilibrium87 were excluded before performing the PCAs.Runs of homozygosity (ROH), Identity by descent (IBD) and IBDNe estimationsIndividuals’ runs of homozygosity (ROH) and identity by descent (IBD) segments were calculated using RefinedIBD (version from 23rd December 2017, with default settings) based on the PREGO dataset. Individuals’ ROH were computed by summing the total length in ROH segments per individual and averaged across individuals within départments and arrondissements. We computed IBD sharing by counting the number of IBD segments shared between pairs of individuals assigned to the 18 clusters inferred by fineSTRUCTURE. This was done independently for IBD segments of length: 1–2 cM, 2–7 cM and longer than 7 cM.We estimated the trajectories of effective population size with IBDNe33 (version released on 7th May 2018). IBD segments were identified with IBDseq (version r1206, with default settings). As suggested by Browning and Browning 201888, we removed regions with excess of IBD to avoid potential bias, such as the Major Histocompatibility Complex (MHC) on chromosome 6 (chr6:26291527-33464061). We thus split chromosome 6 into two continuous parts. To evaluate the robustness of the effective population size trajectories across different IBD segment sizes, we varied the mincm parameter, which sets up the minimum length of IBD segments used by IBDNe. Value of mincm should be chosen considering SNP density on the SNP array.Rare variation analysisWe investigated allele sharing patterns using doubletons (alleles that are present in only two chromosomes or minor allele count MAC = 2) and variants with MACs between 3 and 10 computed from the WGS dataset encompassing 620 individuals from Northwestern France. We first randomly selected 1 million sites from each dataset and allele sharing matrices were computed by summing all the variants shared between pairs of chromosomes of different individuals from Brittany and Pays-de-la-Loire. We plotted the results as a heatmap and performed hierarchical clustering (hclust function in R with method = ”complete”) to identify clusters of highly correlated allele sharing. The resulting tree was cut at multiple levels (k = 2–10) and the proportion of individuals, within each département, assigned to the alternative clusters was plotted onto the map of Northwestern France. These analyses were performed using the R statistical package89 together with the following libraries: ComplexHeatmap, rgdal, sp, broom, ggplot2 and scatterpie.Supervised admixture analysisThe supervised clustering analysis was performed using ADMIXTURE vs1.3 software90 on the WGS dataset including the 843 French samples used in the PCA. We assumed that modern French originate from three modern source populations: Spain, Germany and Ireland, which lie in the extremes of the distribution of modern French individuals in the PCA (Fig. 3a). Given sample size differences (see section Publicly available datasets of western Europeans), we down-sampled the source populations from Germany and Ireland to 350 individuals. SNPs in linkage disequilibrium were removed by following the recommended practises in the software manual, ie. PLINK option –indep-pairwise 50 10 0.1. Only sites with minimum genotype rates <10% were analysed.Ancestry profiles with GLOBETROTTERTo further investigate the contribution of external populations to the genetic makeup of France we estimated ancestry contributions from their European neighbours using the statistical approach implemented in GLOBETROTTER software46. This method identifies and describes recent (<4.5 ky old) admixture events, by providing admixture times and proportions, explaining the haplotypic ancestry of a target population. Given that GLOBETROTTER infers admixture events from patterns of haplotype sharing it requires two input files: the “copy vectors” and “painting samples” generated by the CHROMOPAINTER. To obtain these files we ran CHROMOPAINTER following the author’s recommendations (section 6.2 “GLOBETROTTER”) on the phased “merged modern dataset”. The “copy vectors” files (*.chunklengths.out) were combined using the scripts provided with the software package. To detect signals of admixture we first ran GLOBETROTTER with the option null.ind: 1 and then computed the number of bootstraps providing a date of admixture >= 1 and <= 400 generations. This step provides a p value that can be interpreted as the evidence of “detectable admixture”. P values were computed using 100 bootstraps. Confidence intervals for the inferred admixture times were estimated by running 100 additional bootstraps and with the option: null.ind: 0. We phased the “merged modern dataset” using SHAPEIT and no reference panel, in a similar manner to what was done with the PREGO dataset. To keep the analyses computationally tractable and due to imbalanced sample sizes, we randomly sampled 350 individuals from countries with large sample sizes such as Italy, Germany, Norway and Belgium. To represent the haplotypic diversity of the UK and the Scandinavian Peninsula we only kept the samples from PoBI and Norway, respectively. A total of 3,598 samples were used to estimate ancestry contributions. Proportions of admixture were plotted with the R statistical package89.Admixture analysis: f
3
-, f
4-statistics and qpAdm estimatesWe computed the outgroup f3-statistics of the form f3(Outgroup; Pop1, Pop2), as implemented in Admixtools47, using the Mbuti population as an outgroup. Pop1 and Pop2 were either a pairwise combination of a modern French and any other European population or a Bell Beaker sample and a present-day western European. The outgroup f3- statistics quantifies the amount of shared history between pairs of populations in relation to an outgroup under a three-population tree. Provided that the outgroup is equally distant from Pop1 and Pop2, the outgroup f3-statistics represents the length of the branch from the outgroup to the internal node if no gene flow occurred between Pop1 and Pop2. To statistically assess the presence of admixture we computed f4-statistics using the same software package. Admixture was also computed in order to test whether modern and ancient Mediaeval French show signals of admixture f3-statistics by setting these populations as the target. Admixture f3-statistics was computed using the option inbreed: YES in the case of the ancient Mediaeval as recommended by the authors of the programme.qpAdm, implemented in the Admixtools, is a computer programme that assesses the goodness of fit of admixture models, involving a user-defined set of commonly known as “left” and “right” populations, and estimates the proportion of admixture within a regression framework. The set of “left” populations encompasses the target population and the possible sources of admixture, whereas the “right” populations include a set of populations distantly related to the “left”. Importantly, the model assumes that the “right” populations are differentially related to the “left” and exchange of genes between “left” and “right” must not have occurred after the event of admixture, i.e., genetic drift shared between “left” and “right” derives from deep shared evolutionary history. The method implemented in qpAdm is built on a matrix of f4-statistics computed across all possible combinations of “left” and “right” populations (f4(target, source; R1, R2), where R1 and R2 is any possible pair of “right” populations). P > 0.05 together with admixture proportions varying between 0 and 1 are indicative of a good fitting model.We performed the qpAdm analyses to estimate the contribution of the three major European migrations52,91 using western hunter-gatherers (WHG), Neolithic early-farmers (EF) and Bronze Age steppe pastoralists (SP) as “left” populations together with the Mediaeval and modern French populations. We grouped ancient individuals into WHG, EF and SP based on the ancestry proportions inferred in Mathieson et al.92. Only individuals with ancestry proportions >0.90 were included into the three groups giving origin to sample sizes of 73, 42 and 17 for the EF, WHG and SP respectively. Admixture proportion estimates were computed using 25 randomly selected modern individuals for the three genetic clusters inferred by fineSTRUCTURE (WBR, EBP, and SLO) and the remaining five regions of France (NOR, HAU, GRA, CEN and NOU). With respect to the French Mediaeval samples we inferred admixture proportions at the population (five Medieval French, the outlier was removed from this computation) and individual level. The set of right populations used in these analyses was the following: Mbuti, Han, AfontovaGora3, ElMiron, Karitiana, Kontenki14, MA1, Mota, Papuan, Ust_Ishim, Vestonice16, Villabruna, GoyetQ116-1, Morocco_Iberomaurusian.In order to test for continuity of the genetic ancestry of Mediaeval and present-day French populations we tested one-way models. In the case of Mediaeval French, the single source is a neighbouring population (no post-Neolithic sample from France is available in the merged dataset) of the same period or of the preceding one. In the case of present-day French, the single source included the Mediaeval French and all the possible sources tested for the Mediaeval French. Furthermore, we also tested two-way models where the genetic ancestry of modern French is assumed to result from a mixture between the Mediaeval French and any other ancient population from the neighbouring regions from Mediaeval to present-day times. Given the larger availability of Iberian samples and the close geographical proximity with France we included in the aforementioned analyses all the Iberian samples from the 5th century BCE to the Mediaeval period (up to the 10th century CE). Only targets and source populations with a sample size larger than two were used in these analyses.The early Neolithic period is characterised by the arrival of the first farmers in Western Europe and the Early Bronze Age is characterised by the arrival of populations related to or carrying a large proportion of their ancestry from pastoralist populations from the Eurasian steppes52,91. Therefore, we added Anatolia_N and Russia_EBA_Yamnaya_Samara to the set of right populations: Mbuti, Han, AfontovaGora3, ElMiron, Karitiana, Kontenki14, MA1, Mota, Papuan, Ust_Ishim, Vestonice16, Villabruna, GoyetQ116-1, Morocco_Iberomaurusian, Anatolia_N and Russia_EBA_Yamnaya_Samara.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles