133 research outputs found
Principal component analysis of two populations.
<p>(A) Consider a sample of individuals from population A (indicated by the red circle) and from population B (indicated by the blue circle), where the two populations have the same effective population size of and are both derived from a single ancestral population, also of size , with the split happening a time in the past. (B) The expected locations of these two sets of samples on the first PC is defined by the time since divergence (the Euclidean distance between the samples is ) (see text for definitions) and the relative sample size from the populations, with the larger sample lying closer to the origin. Defining , the relative location of the two populations on the first PC are for samples from population A and for samples from population B (note that the sign is arbitrary). (C) To investigate the effect of finite genome size simulations were carried out for the model shown in part A with 80 genomes sampled from population A, 20 from population B and a split time of 0.02 generations () and between and SNPs. Lines indicate the analytical expectation. A jitter has been added to the x-axis for clarity. Note that the separation of samples with 10 SNPs does not correlate with population and simply reflects random clustering arising from the small numbers of SNPs.</p
The effect of uneven sampling on PCA projection.
<p>PCA projection of samples taken from a set of nine populations arranged in a lattice, each of which exchanges migrants at rate per generations with each adjoining neighbour, leads to a recovery of the migration-space if samples are of equal size (A), or a distortion of migration-space if populations are not equally represented (B,C). In each part the left-hand panel shows the analytical solution (the area of each point represents the relative sample size) with migration routes illustrated while the right-hand panel shows the result of a simulation with a total sample size of 180 and 10,000 independent SNP loci. All examples are for .</p
Admixture proportions inferred from PCA projections.
<p>(A) For each of the autosomes (chromosome 1 is the lowest) the points indicate the locations of sampled haplotypes (the transmitted and untransmitted haplotypes inferred from trios) on the first principal component (each chromosome is analysed separately; blue = CEU, orange = YRI, green = ASW). Importantly, PCA is carried out only on the haplotypes from CEU and YRI and all samples are subsequently projected onto the first PC identified from this analysis. Lines connect the transmitted (or untransmitted) haplotypes for each individual across chromosomes. Note the uniformity of the locations of samples on the first PC for CEU and YRI. Individual chromosomes within the ASW, however, show a great range of locations on the first PC. (B) The genome-wide admixture proportions (separately for transmitted and untransmitted chromosomes) can be inferred directly from the location of admixed samples on the first PC between the two source populations. Colours are as for (A). The vertical spacing of points is arbitrary.</p
Identification of admixture proportions without source populations.
<p>Initially an admixed population is formed by random mating from two populations, each fixed for a different allele at each locus with 40% contribution from one population. In the simulated population there are 1000 individuals, each of which has 20 chromosomes with 50 markers each, a genetic map length of 1 per chromosome and a uniform recombination rate. Subsequent generations are formed by random mating of the ancestral population. (A) Projections of 100 randomly chosen samples on the first PC over time show a decay in the fraction of variance explained by the first PC (note that the total variance in the population decays little over the time-scale of the simulation). (B) Admxiture proportions for the same individuals as in part A (blue points) as well as the everage heterozygosity (red line) and the fraction of the variance in PC1 explained by admixture proportions (black line). While there is a strong association between admixture proportion and location on PC1 for the first few generations, after 15 generations recombination has eliminated any signal, even though there is still strong admixture LD between nearby markers (data not shown).</p
The effect of SNP ascertainment on PCA projection.
<p>(A) In the joint genealogy of the ascertainment (black circles) and genotyped samples (grey circles), only mutations occurring on the intersection of the two genealogies (shown in black) will be detected in both samples. For small discovery panels and large experimental samples, this may be considerably less than half the total genealogy length. (B) Model used to simulate data from three populations linked by two vicariance events, each of which is associated with a bottleneck; the model is an approximation to the demographic history of the HapMap populations <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1000686#pgen.1000686-Schaffner1" target="_blank">[17]</a>,<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1000686#pgen.1000686-The1" target="_blank">[18]</a>. In the simulations 100 haploid genomes with 10,000 unlinked loci were sampled from each population and the parameters are , , , , where is the bottleneck strength measured as the probability that two lineages entering the bottleneck have coalesced by its end (the bottleneck is instantaneous in real time). All populations have the same effective population size. (C) PCA of the simulated data (small open circles) shows strong agreement with results obtained from analytical consideration of the expected coalescence times (large circles). When only those SNPs that have been discovered in a small panel are considered (here modelled as 4, 8, and 4 additional samples from populations I, II, and III respectively) the principal effect is to scale the locations of the samples on the first two PCs (small filled circles) by a factor of approximately (large diamonds).</p
Genealogical statistics.
<p>The chart shows a genealogical tree describing the history of a sample of size five. Two samples, and , will share a derived mutation (indicated by the circle) if it occurs on the branch between their most recent common ancestor and the common ancestor of the whole sample. The length of this branch is .</p
Comparison with MSMC, and the effect of estimating haplotypes with sequence data.
<p><b>A</b>: The age distribution of haplotypes shared between CHB and CEU estimated with array, sequence and “clean” sequence (with indels and low complexity regions removed; <b><a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004528#s4" target="_blank">Methods</a></b>). Coloured dashed lines show the medians of each distribution. The grey stepped line shows relative cross-population coalescence rates estimated by MSMC (S. Schiffels, personal communication), and the grey dashed line shows the earliest date in the oldest time interval where this rate is less that 0.5. In both cases, we assume 30 years per generation and . <b>B</b>: As in <b>A</b> but for haplotypes shared between CHB and MXL, restricted to haplotypes where the MXL individual is inferred to be homozygous for Native American ancestry. <b>C–D</b>: Age distributions inferred using “clean” sequence data, comparable to <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004528#pgen-1004528-g003" target="_blank">Figure 3A 3A–B</a> (Note the extended x-axis).</p
The ages of haplotypes around variants with different functional annotations.
<p>Density is indicated by the width of the shape, and horizontal bars show the median. We show separately the densities for variants shared within a population (left, blue), and variants shared between populations (right, red). Numbers in brackets show the number of variants in each class. Bars show the pairwise differences in means, and test p-values for a difference in log means between groups.</p
Application of GEVA to 3 variants of phenotypic and selective importance.
(A) Estimated TMRCAs for concordant (left) and discordant (right) pairs of chromosomes for the derived T allele at rs182549, which lies within an intron of MCM6 and affects regulation of LCT [33], which encodes lactase. Each bar reflects the approximate 95% credible interval (ETPI) for a pair, ordered by posterior mean (black dots). Data from the TGP (green) [2] and the SGDP (orange) [36] were used. The frequency of the variant in the SGDP, the TGP, and the different population groups in the TGP is shown (top left). The inferred allele age in generations from each data source and the combined estimate are shown (bottom right) and converted to an approximate age in years, assuming 20–30 years per generation. See https://human.genome.dating/snp/rs182549 for additional results. (B) As for panel A for the derived G allele of rs3827760, which encodes the Val370Ala variant in EDAR and is associated with sweat and facial and body morphology [41, 42]; also see https://human.genome.dating/snp/rs3827760. Our filtering approach is to remove the smallest number of concordant and discordant pairs necessary (shown in pink) to obtain concordant and discordant sets with nonoverlapping mean posterior TMRCAs. (C) As for panel A for the derived C allele of rs80194531, which encodes the Asn78Thr substitution in ZEB1, reported as pathogenic for corneal dystrophy [43]; also see https://human.genome.dating/snp/rs80194531. Abbreviations refer to ancestry groups. AFR, African; AMR, American; EAS, East Asian; EDAR, Ectodysplasin A Receptor gene; ETPI, equal-tailed probability interval; EUR, European; GEVA, Genealogical Estimation of Variant Age; LCT, Lactase gene; MCM6, Minichromosome Maintenance Complex Component 6 gene; SAS, South Asian; SGDP, Simons Genome Diversity Project; TGP, 1000 Genome Project; TMRCA, time to the most recent common ancestor; ZEB1, Zinc finger E-box–binding homeobox 1 gene.</p
Age-stratified connections between ancestry groups in the publicly available SGDP sample.
The CCF was inferred for all 556 haploid target genomes with all other comparator genomes in the SGDP sample and then aggregated by ancestry group (mean of CCFs from individuals within a population) and across chromosomes, with populations as defined in the SGDP (see legend on the right). (A–D) The ancestry shared between populations is indicated by the CIF over a given time interval (epoch), shown as a matrix with populations sorted from north to south within continental regions. Intensities were computed from aggregated CCFs to summarize relationships between populations; colors indicate intensity scaled per target population (rows) by the maximum over comparator populations. Ancestral connections are shown at different epochs back in time; around 200 generations ago (A), 800 generations (B), 4,000 generations (C), and 20,000 generations (D). The conversion (top right) assumes 20–30 years per generation. A more detailed summary, showing the ancestry shared between individuals, over a sliding time window (epoch) is shown in S3 Movie. (E) The maximum CIF for individuals from different ancestry groups (continental regions) expressed as effective population size (Ne) equivalents over time, estimated from CCFs aggregated per diploid individual and summarized by the median and interquartile range per group. Triangles indicate the epochs shown in panels A–D. A further breakdown of Ne equivalents estimated from nonaggregated CCFs per chromosome is shown in S8 Fig. CCF, cumulative coalescent function; CIF, coalescent intensity function; SGDP, Simons Genome Diversity Project.</p
- …