Search CORE

9 research outputs found

Comparisons of methods to evaluate the haplotype number.

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date
Field of study

We compare the Bunge and Barger (2008) method as implemented in the Catchall model using a population with scale free frequency distribution of haplotypes with a slope of -1.5 and compare the estimated species richness as supplied by the Catchall software for all models where the software converges to a result. These models include parametric predictions procedures, such as Mixture-of-two-exponentials-mixed Poisson and Mixture-of-three-exponentials-mixed Poisson on the observed species richness or on their log values and non-parametric procedures, such as the ACE (Abundance-based Coverage Estimator) ACE1 (Abundance-based Coverage Estimator for highly heterogeneous cases) and different versions of the Chao-Bunge gamma-Poisson estimator. Each line represents the estimated number of haplotypes using one of the Catchall methods. The black line represents the real simulated species number. The purple line with diamonds represents our estimates. Our estimates converge much faster and to a much more precise estimate of the real haplotype number than any of the methods proposed by Catchall.</p

The Francis Crick Institute

Power Laws for Heavy-Tailed Distributions: Modeling Allele and Haplotype Diversity for the National Marrow Donor Program

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date: 01/04/2015
Field of study

<div>Measures of allele and haplotype diversity, which are fundamental properties in population genetics, often follow heavy tailed distributions. These measures are of particular interest in the field of hematopoietic stem cell transplant (HSCT). Donor/Recipient suitability for HSCT is determined by Human Leukocyte Antigen (HLA) similarity. Match predictions rely upon a precise description of HLA diversity, yet classical estimates are inaccurate given the heavy-tailed nature of the distribution. This directly affects HSCT matching and diversity measures in broader fields such as species richness. We, therefore, have developed a power-law based estimator to measure allele and haplotype diversity that accommodates heavy tails using the concepts of regular variation and occupancy distributions. Application of our estimator to 6.59 million donors in the Be The Match Registry revealed that haplotypes follow a heavy tail distribution across all ethnicities: for example, 44.65% of the European American haplotypes are represented by only 1 individual. Indeed, our discovery rate of all U.S. European American haplotypes is estimated at 23.45% based upon sampling 3.97% of the population, leaving a large number of unobserved haplotypes. Population coverage, however, is much higher at 99.4% given that 90% of European Americans carry one of the 4.5% most frequent haplotypes. Alleles were found to be less diverse suggesting the current registry represents most alleles in the population. Thus, for HSCT registries, haplotype discovery will remain high with continued recruitment to a very deep level of sampling, but population coverage will not. Finally, we compared the convergence of our power-law versus classical diversity estimators such as Capture recapture, Chao, ACE and Jackknife methods. When fit to the haplotype data, our estimator displayed favorable properties in terms of convergence (with respect to sampling depth) and accuracy (with respect to diversity estimates). This suggests that power-law based estimators offer a valid alternative to classical diversity estimators and may have broad applicability in the field of population genetics.</div

Directory of Open Access Journals

PubMed Central

The Francis Crick Institute

Properties for haplotypes frequency distribution for 5 populations

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date
Field of study

: Native Americans (NAM), European Americans (CAU), Hispanic (HIS), African-Americans (AFA) and Asian Pacific Islanders (API). (A) Histogram of haplotype relative frequency distribution for the five combined populations: Native Americans (NAM, black circles), European Americans (CAU, gray triangles), Hispanic (HIS, green diamonds), African-Americans (AFA, purple circles) and Asian Pacific Islanders (API, red circles). (B) Estimated (black) and currently observed (white) total number of haplotypes. (C) Fraction of non-covered population for the five combined populations. (D) Current sample size (white) and estimated sample size required to get coverage similar to the European American population (black) for the five combined populations. Note that if the fit to a truncated power law would be precise, the required and observed population size would be precisely equal for the Caucasian population. However, there are some limited deviations in the fit, and the required population from the theoretical analysis is 7.76E+06, while the observed population size is 7.82E6 (a difference of less than 1%).</p

The Francis Crick Institute

List of symbols used.

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date
Field of study

List of symbols used.</p

The Francis Crick Institute

Estimation of α and of haplotype number for different sample sizes.

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date
Field of study

A. Convergence of estimates of the power of distribution α to real value in samples from a simulated scale free distribution with a lower cutoff. The real value is the black line with the 'x' signs. The estimate using our method quickly converges to the proper value—1.5 (purple line with diamonds). The estimates using either the discrete (red line with squares) or continuous (blue lines with circles) Clauset estimates, or the Ohannessian et al. estimate (green lines with circles) do not converge, even when more than half the distribution is sampled. A Clauset discrete estimate with a minimal cutoff (orange line with circles) converges almost as well as our algorithm. B. Comparison of haplotype number estimate as a function of the sample size, using a capture recapture method (red squares), Jackknife estimators and the parametric estimate proposed here (blue diamonds), using the same simulation as above. The real number is a black full line with 'x' signs. One can clearly see that the parametric estimate developed here converges to a good estimate, even for a very small sample.</p

The Francis Crick Institute

Schematic figure.

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date
Field of study

(A) We observe a population with different haplotypes. From the population we extract two measures—the haplotype frequency distribution (B) and the number of unique haplotypes as a function of the sample size (C). We assume that the frequency distribution (B) is a scale free distribution with upper and lower cutoff values Xmin≤x≤Xmax. For the sake of simplicity, we assume a zero probability to observe haplotypes with frequencies beyond these values. We provide an initial guess for the lower cutoff to be limited by the total population size, and the upper cutoff, we limit by an upper estimate from the total population size (Eq B6). We use an initial guess of Xmin at its boundary, the Clauset estimate for the slope, and the observed highest frequency X 0 max. We then fit the observed unique haplotype curve (C) with an analytical formula for the expected shape (D) with a cost function of <mo>∑</mo>R'<mo>(</mo><mi>log</mi><mo stretchy="false">(</mo>U(R')-log(observed (R')<mo>)</mo><mn>2</mn><mo>*</mo><mi>log</mi><mo stretchy="false">(</mo>observed (R')) for different values of sample sizes R'. Finally, we extract the optimal parameters (E) and produce an estimate of the number of unique haplotypes for any population size N, where N is the target population size.</p

The Francis Crick Institute

Alleles results.

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date
Field of study

(left) Known (colored) vs. total alleles (transparent) for the five combined populations. The fraction of known alleles is between 50 and 100%. (right) Fraction of uncovered population per allele and combined population. These fractions are much lower than for haplotypes and never reach more than 0.1%.</p

The Francis Crick Institute

The values of α obtained from the 21 detailed US populations studied here.

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date
Field of study

The second column is the number of samples used for these populations. The following columns are α estimates, based on the Clauset estimator (third column) and our parametric method, using half the sample or the full sample (last two columns).The values of α obtained from the 21 detailed US populations studied here.</p

The Francis Crick Institute

Analysis of the five combined populations (African-Americans, Asian and Pacific Islanders, European Americans, Hispanic and Native Americans).

Author: Ansu Chatterjee (728358)
Loren Gragert (139115)
Mark Albrecht (728359)
Martin Maiers (122566)
Noa Slater (728357)
Yoram Louzoun (201685)
Publication venue
Publication date
Field of study

The first row is the total size of the sub-population in the census, the second row is the estimate of α from U(R). The third row is the sample size. The following rows are fraction of the population covered, current known haplotypes number, estimate of maximal haplotype number, estimate of observed haplotype number according to the U(R) estimation. The following row (eighth row) is the estimated number of haplotypes for the entire census population. The last row is the required population size to get coverage similar to the European American population.Analysis of the five combined populations (African-Americans, Asian and Pacific Islanders, European Americans, Hispanic and Native Americans).</p

The Francis Crick Institute

Comparisons of methods to evaluate the haplotype number.

Power Laws for Heavy-Tailed Distributions: Modeling Allele and Haplotype Diversity for the National Marrow Donor Program

Properties for haplotypes frequency distribution for 5 populations

List of symbols used.

Estimation of <i>α</i> and of haplotype number for different sample sizes.

Schematic figure.

Alleles results.

The values of <i>α</i> obtained from the 21 detailed US populations studied here.

Analysis of the five combined populations (African-Americans, Asian and Pacific Islanders, European Americans, Hispanic and Native Americans).