52 research outputs found
The effect of uneven sampling on PCA projection.
<p>PCA projection of samples taken from a set of nine populations arranged in a lattice, each of which exchanges migrants at rate per generations with each adjoining neighbour, leads to a recovery of the migration-space if samples are of equal size (A), or a distortion of migration-space if populations are not equally represented (B,C). In each part the left-hand panel shows the analytical solution (the area of each point represents the relative sample size) with migration routes illustrated while the right-hand panel shows the result of a simulation with a total sample size of 180 and 10,000 independent SNP loci. All examples are for .</p
Admixture proportions inferred from PCA projections.
<p>(A) For each of the autosomes (chromosome 1 is the lowest) the points indicate the locations of sampled haplotypes (the transmitted and untransmitted haplotypes inferred from trios) on the first principal component (each chromosome is analysed separately; blue = CEU, orange = YRI, green = ASW). Importantly, PCA is carried out only on the haplotypes from CEU and YRI and all samples are subsequently projected onto the first PC identified from this analysis. Lines connect the transmitted (or untransmitted) haplotypes for each individual across chromosomes. Note the uniformity of the locations of samples on the first PC for CEU and YRI. Individual chromosomes within the ASW, however, show a great range of locations on the first PC. (B) The genome-wide admixture proportions (separately for transmitted and untransmitted chromosomes) can be inferred directly from the location of admixed samples on the first PC between the two source populations. Colours are as for (A). The vertical spacing of points is arbitrary.</p
Genealogical statistics.
<p>The chart shows a genealogical tree describing the history of a sample of size five. Two samples, and , will share a derived mutation (indicated by the circle) if it occurs on the branch between their most recent common ancestor and the common ancestor of the whole sample. The length of this branch is .</p
Identification of admixture proportions without source populations.
<p>Initially an admixed population is formed by random mating from two populations, each fixed for a different allele at each locus with 40% contribution from one population. In the simulated population there are 1000 individuals, each of which has 20 chromosomes with 50 markers each, a genetic map length of 1 per chromosome and a uniform recombination rate. Subsequent generations are formed by random mating of the ancestral population. (A) Projections of 100 randomly chosen samples on the first PC over time show a decay in the fraction of variance explained by the first PC (note that the total variance in the population decays little over the time-scale of the simulation). (B) Admxiture proportions for the same individuals as in part A (blue points) as well as the everage heterozygosity (red line) and the fraction of the variance in PC1 explained by admixture proportions (black line). While there is a strong association between admixture proportion and location on PC1 for the first few generations, after 15 generations recombination has eliminated any signal, even though there is still strong admixture LD between nearby markers (data not shown).</p
The effect of SNP ascertainment on PCA projection.
<p>(A) In the joint genealogy of the ascertainment (black circles) and genotyped samples (grey circles), only mutations occurring on the intersection of the two genealogies (shown in black) will be detected in both samples. For small discovery panels and large experimental samples, this may be considerably less than half the total genealogy length. (B) Model used to simulate data from three populations linked by two vicariance events, each of which is associated with a bottleneck; the model is an approximation to the demographic history of the HapMap populations <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1000686#pgen.1000686-Schaffner1" target="_blank">[17]</a>,<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1000686#pgen.1000686-The1" target="_blank">[18]</a>. In the simulations 100 haploid genomes with 10,000 unlinked loci were sampled from each population and the parameters are , , , , where is the bottleneck strength measured as the probability that two lineages entering the bottleneck have coalesced by its end (the bottleneck is instantaneous in real time). All populations have the same effective population size. (C) PCA of the simulated data (small open circles) shows strong agreement with results obtained from analytical consideration of the expected coalescence times (large circles). When only those SNPs that have been discovered in a small panel are considered (here modelled as 4, 8, and 4 additional samples from populations I, II, and III respectively) the principal effect is to scale the locations of the samples on the first two PCs (small filled circles) by a factor of approximately (large diamonds).</p
Short descriptions of the 1000 Genomes populations.
<p>Short descriptions of the 1000 Genomes populations.</p
Algorithm and model for haplotypes.
<p><b>A</b>: Algorithm for detecting haplotypes. For each variant in the sample (green), we scan left and right until we find inconsistent homozygote genotypes (red), record the physical and genetic length of this region (blue), and the number of singletons (purple). <b>B</b>: Model for haplotype age . Consider the 4 chromosomes (grey) of the two individuals sharing an haplotype (blue). We model the total genetic length of the inferred haplotype, , as the sum of the true genetic length and an error . Similarly, we model the number of singletons as the sum of the number on the shared chromosome () and the number on the unshared chromosomes, . We ignore the fact that we overestimate and therefore that some of the singletons might lie in the unshared part of the chromosome.</p
The estimated age distribution of haplotypes.
<p><b>A</b>: The distribution of the MLE of the ages of haplotypes shared within each population. <b>B–F</b>: The distribution of the MLE of the ages of haplotypes shared between one population and all other populations, shown for each of GBR, JPT, LWK, ASW, and PUR. Populations are described in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004528#pgen-1004528-t001" target="_blank">Table 1</a>. Density estimates are computed in space, using the base R <i>density</i> function with a Gaussian kernel.</p
Estimating age from simulated data.
<p>We simulated whole genomes for 100 individuals (200 chromosomes), with , and HapMap 2 recombination rates. <b>A</b>: Estimated age against true age. The grey dots are the MLEs for each detected haplotype. The blue line is a quantile-quantile (qq) plot for the MLEs (from the 1<i><sup>st</sup></i> to 99<i><sup>th</sup></i> percentile). <b>B–D</b> Power to detect haplotypes as a function of <b>B</b>: genetic length, <b>C</b>: physical length and <b>D</b>: haplotype age; in each case the darker line represents the power to detect haplotype with 100% power to detect variants, and the lighter line the power with 66% power.</p
Patterns of Haplotype Structure and Recombination in the HapMap ENCODE Region on Chromosome 7q31.33
<p>The estimated recombination rate (in centimorgans per megabase) is shown as a dark blue line, with statistically significant recombination hotspots (see [<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.0010054#pgen-0010054-b15" target="_blank">15</a>] for details) as grey lines. For each analysis panel, each non-redundant haplotype with a frequency of at least 10% is represented by a horizontal line between the starting and ending SNPs (see [<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.0010054#pgen-0010054-b15" target="_blank">15</a>] for details of methodology); the vertical height of these lines is arbitrary. Note that only one of the six hotspots is sufficiently strong to break all common haplotypes.</p
- …