10 research outputs found
Properties for haplotypes frequency distribution for 5 populations
<p>: Native Americans (NAM), European Americans (CAU), Hispanic (HIS), African-Americans (AFA) and Asian Pacific Islanders (API). (A) Histogram of haplotype relative frequency distribution for the five combined populations: Native Americans (NAM, black circles), European Americans (CAU, gray triangles), Hispanic (HIS, green diamonds), African-Americans (AFA, purple circles) and Asian Pacific Islanders (API, red circles). (B) Estimated (black) and currently observed (white) total number of haplotypes. (C) Fraction of non-covered population for the five combined populations. (D) Current sample size (white) and estimated sample size required to get coverage similar to the European American population (black) for the five combined populations. Note that if the fit to a truncated power law would be precise, the required and observed population size would be precisely equal for the Caucasian population. However, there are some limited deviations in the fit, and the required population from the theoretical analysis is 7.76E+06, while the observed population size is 7.82E6 (a difference of less than 1%).</p
Power Laws for Heavy-Tailed Distributions: Modeling Allele and Haplotype Diversity for the National Marrow Donor Program
<div><p>Measures of allele and haplotype diversity, which are fundamental properties in population genetics, often follow heavy tailed distributions. These measures are of particular interest in the field of hematopoietic stem cell transplant (HSCT). Donor/Recipient suitability for HSCT is determined by Human Leukocyte Antigen (HLA) similarity. Match predictions rely upon a precise description of HLA diversity, yet classical estimates are inaccurate given the heavy-tailed nature of the distribution. This directly affects HSCT matching and diversity measures in broader fields such as species richness. We, therefore, have developed a power-law based estimator to measure allele and haplotype diversity that accommodates heavy tails using the concepts of regular variation and occupancy distributions. Application of our estimator to 6.59 million donors in the Be The Match Registry revealed that haplotypes follow a heavy tail distribution across all ethnicities: for example, 44.65% of the European American haplotypes are represented by only 1 individual. Indeed, our discovery rate of all U.S. European American haplotypes is estimated at 23.45% based upon sampling 3.97% of the population, leaving a large number of unobserved haplotypes. Population coverage, however, is much higher at 99.4% given that 90% of European Americans carry one of the 4.5% most frequent haplotypes. Alleles were found to be less diverse suggesting the current registry represents most alleles in the population. Thus, for HSCT registries, haplotype discovery will remain high with continued recruitment to a very deep level of sampling, but population coverage will not. Finally, we compared the convergence of our power-law versus classical diversity estimators such as Capture recapture, Chao, ACE and Jackknife methods. When fit to the haplotype data, our estimator displayed favorable properties in terms of convergence (with respect to sampling depth) and accuracy (with respect to diversity estimates). This suggests that power-law based estimators offer a valid alternative to classical diversity estimators and may have broad applicability in the field of population genetics.</p></div
Schematic figure.
<p>(A) We observe a population with different haplotypes. From the population we extract two measuresâthe haplotype frequency distribution (B) and the number of unique haplotypes as a function of the sample size (C). We assume that the frequency distribution (B) is a scale free distribution with upper and lower cutoff values <i>X</i><sub>min</sub>â€<i>x</i>â€<i>X</i><sub>max</sub>. For the sake of simplicity, we assume a zero probability to observe haplotypes with frequencies beyond these values. We provide an initial guess for the lower cutoff to be limited by the total population size, and the upper cutoff, we limit by an upper estimate from the total population size (Eq B6). We use an initial guess of <i>X</i><sub>min</sub> at its boundary, the Clauset estimate for the slope, and the observed highest frequency <i>X</i> 0 <sub>max</sub>. We then fit the observed unique haplotype curve (C) with an analytical formula for the expected shape (D) with a cost function of </p><p></p><p></p><p></p><p></p><p><mo>â</mo></p><p>R'</p><p></p><p></p><p></p><p></p><p><mo>(</mo></p><p><mi>log</mi><mo stretchy="false">(</mo>U(R')-log(observed (R')</p><mo>)</mo><p></p><p></p><mn>2</mn><p></p><p></p><p></p><mo>*</mo><mi>log</mi><mo stretchy="false">(</mo>observed (R'))<p></p><p></p><p></p> for different values of sample sizes R'. Finally, we extract the optimal parameters (E) and produce an estimate of the number of unique haplotypes for any population size N, where N is the target population size.<p></p
Estimation of <i>α</i> and of haplotype number for different sample sizes.
<p><b>A.</b> Convergence of estimates of the power of distribution <i>α</i> to real value in samples from a simulated scale free distribution with a lower cutoff. The real value is the black line with the 'x' signs. The estimate using our method quickly converges to the proper valueâ1.5 (purple line with diamonds). The estimates using either the discrete (red line with squares) or continuous (blue lines with circles) Clauset estimates, or the Ohannessian et al. estimate (green lines with circles) do not converge, even when more than half the distribution is sampled. A Clauset discrete estimate with a minimal cutoff (orange line with circles) converges almost as well as our algorithm. <b>B.</b> Comparison of haplotype number estimate as a function of the sample size, using a capture recapture method (red squares), Jackknife estimators and the parametric estimate proposed here (blue diamonds), using the same simulation as above. The real number is a black full line with 'x' signs. One can clearly see that the parametric estimate developed here converges to a good estimate, even for a very small sample.</p
Comparisons of methods to evaluate the haplotype number.
<p>We compare the Bunge and Barger (2008) method as implemented in the Catchall model using a population with scale free frequency distribution of haplotypes with a slope of -1.5 and compare the estimated species richness as supplied by the Catchall software for all models where the software converges to a result. These models include parametric predictions procedures, such as Mixture-of-two-exponentials-mixed Poisson and Mixture-of-three-exponentials-mixed Poisson on the observed species richness or on their log values and non-parametric procedures, such as the ACE (Abundance-based Coverage Estimator) ACE1 (Abundance-based Coverage Estimator for highly heterogeneous cases) and different versions of the Chao-Bunge gamma-Poisson estimator. Each line represents the estimated number of haplotypes using one of the Catchall methods. The black line represents the real simulated species number. The purple line with diamonds represents our estimates. Our estimates converge much faster and to a much more precise estimate of the real haplotype number than any of the methods proposed by Catchall.</p
Alleles results.
<p>(left) Known (colored) vs. total alleles (transparent) for the five combined populations. The fraction of known alleles is between 50 and 100%. (right) Fraction of uncovered population per allele and combined population. These fractions are much lower than for haplotypes and never reach more than 0.1%.</p
The values of <i>α</i> obtained from the 21 detailed US populations studied here.
<p>The second column is the number of samples used for these populations. The following columns are <i>α</i> estimates, based on the Clauset estimator (third column) and our parametric method, using half the sample or the full sample (last two columns).</p><p>The values of <i>α</i> obtained from the 21 detailed US populations studied here.</p
Analysis of the five combined populations (African-Americans, Asian and Pacific Islanders, European Americans, Hispanic and Native Americans).
<p>The first row is the total size of the sub-population in the census, the second row is the estimate of <i>α</i> from <i>U</i>(<i>R</i>). The third row is the sample size. The following rows are fraction of the population covered, current known haplotypes number, estimate of maximal haplotype number, estimate of observed haplotype number according to the <i>U</i>(<i>R</i>) estimation. The following row (eighth row) is the estimated number of haplotypes for the entire census population. The last row is the required population size to get coverage similar to the European American population.</p><p>Analysis of the five combined populations (African-Americans, Asian and Pacific Islanders, European Americans, Hispanic and Native Americans).</p
MOESM9 of Impact of the shedding level on transmission of persistent infections in Mycobacterium avium subspecies paratuberculosis (MAP)
Additional file 9. Contribution results of each infectivity term with model containing different parameters for each farm separately. (In the main text data appears for ML âY1+Y2â in all figures.) Contribution of each term in the models to the average infectivity in each farm, when optimization was done separately on each farm. The first term (α) is infection by free (externally sourced) bacteria. The second term (ÎČ) is cow-to-cow infection and the last term (ÎŽ) is a constant source. In the âonly Y1â a cow is regarded as âinfectiousâ from its first positive sample until the last positive sample. In the "Y1+Y2" model, a cow is regarded âinfectiousâ from its first sample until its death. In the âH+Y1+Y2â model, a cow is regarded âinfectiousâ from its birth to its death (and also if there is a positive ELISA/tissue sample)