16 research outputs found

    Quantification of HTLV-1 Clonality and TCR Diversity

    Get PDF
    <div><p>Estimation of immunological and microbiological diversity is vital to our understanding of infection and the immune response. For instance, what is the diversity of the T cell repertoire? These questions are partially addressed by high-throughput sequencing techniques that enable identification of immunological and microbiological “species” in a sample. Estimators of the number of unseen species are needed to estimate population diversity from sample diversity. Here we test five widely used non-parametric estimators, and develop and validate a novel method, <i>DivE</i>, to estimate species richness and distribution. We used three independent datasets: (i) viral populations from subjects infected with human T-lymphotropic virus type 1; (ii) T cell antigen receptor clonotype repertoires; and (iii) microbial data from infant faecal samples. When applied to datasets with rarefaction curves that did not plateau, existing estimators systematically increased with sample size. In contrast, <i>DivE</i> consistently and accurately estimated diversity for all datasets. We identify conditions that limit the application of <i>DivE</i>. We also show that <i>DivE</i> can be used to accurately estimate the underlying population frequency distribution. We have developed a novel method that is significantly more accurate than commonly used biodiversity estimators in microbiological and immunological populations.</p></div

    Comparison of estimator performance for TCR data.

    No full text
    <p>*Median absolute percentage error between <i>S<sub>obs</sub></i> and <i>Ŝ<sub>obs</sub></i>.</p>†<p>p-value of the significance of the differences between the errors of each estimator and <i>DivE</i> (n = 24; two-tailed binomial test).</p

    Validation of <i>DivE</i> distribution generation algorithm.

    No full text
    <p>The <i>DivE</i> distribution generation algorithm (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003646#pcbi-1003646-g002" target="_blank">Figure 2</a>) was applied to random samples (red dashed) of observed data (black solid). Accuracy was evaluated by comparing the estimated distribution (orange dashed) to the true distribution of the full observed data (black). Examples for HTLV-1 <b>A</b>, TCR <b>B</b> and microbial datasets <b>C</b> are shown.</p

    Outline of <i>DivE</i> distribution generation algorithm.

    No full text
    <p><b>A</b> Truncated species frequency distribution with <i>x</i> individuals distributed among <i>y</i> species. The frequency of species <i>S<sub>i</sub></i> after sampling <i>x</i> individuals is denoted <i>F<sub>x</sub>(S<sub>i</sub>)</i>. <b>B</b> Species accumulation data generated from frequency distribution. <b>C</b> An aggregate of the best performing models as returned by <i>DivE</i> is used to extrapolate to point <i>(x+a, y+1)</i>, where the next species is predicted. <b>D</b> Species <i>S<sub>y+1</sub></i> is assigned a frequency of <i>(1 - p<sub>max</sub>)(x+a)</i>, where <i>p<sub>max</sub></i> is the maximum-likelihood proportion of individuals occupied by the <i>y</i> previously observed species. The remaining <i>p<sub>max</sub>(x+a)</i> individuals are distributed among species <i>S<sub>1</sub></i>, …, <i>S<sub>y</sub></i> in proportion to their observed relative frequencies at <i>x</i>. Steps <b>C</b> and <b>D</b> are repeated until the predicted species richness is reached. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003646#pcbi.1003646.s012" target="_blank">Text S1</a> for further details.</p

    Performance of <i>DivE</i> frequency distribution generation algorithm.

    No full text
    <p>*Mean error across all subjects and all small subsamples, for each data source. Small subsamples were defined as those ≤50% of the size of the observed each patient data set. Error defined as the sum of absolute discrepancies between true and estimated frequency distributions, divided by area under true distribution.</p>†<p>Mean percentage error across all subjects and all small subsamples in the Gini coefficients of the true and estimated distributions.</p

    Outline of <i>DivE</i> species richness estimator.

    No full text
    <p><i>DivE</i> fits many models to rarefaction curves (black) and subsamples thereof (orange). Data is denoted by circles; fits by solid lines. Models are scored according to the following criteria: <b>i) </b><b><i>Discrepancy</i></b> – mean percentage error between data points and model prediction; <b>ii) </b><b><i>Accuracy</i></b> – error between full sample species richness (purple cross) and estimated species richness from subsample; <b>iii) </b><b><i>Similarity</i></b> – area between subsample fit (orange) and full data fit (black); and <b>iv) </b><b><i>Plausibility</i></b> – we require that <i>S'(x) ≥0</i> and <i>S"(x) ≤0</i>. The best performing models are aggregated and extrapolated to estimate species richness. Model A performs poorly as criteria ii) and iii) are not satisfied. Model B performs well as all criteria are satisfied.</p

    Test of species richness estimators at different values of curvature parameter (<i>C<sub>p</sub></i>) using TCR data.

    No full text
    <p>The curvature parameter <i>C<sub>p</sub></i> is plotted against the relative error (|<i>S<sub>obs</sub></i> - <i>Ĺś<sub>obs</sub></i>| /<i>S<sub>obs</sub></i>) of each estimator. Four patient data sets are shown: <b>A</b> total CD4<sup>+</sup> from patient C; <b>B</b> total CD4<sup>+</sup> from patient E; <b>C</b> total CD8<sup>+</sup> from patient C; <b>D</b> total CD8<sup>+</sup> from patient E. Each point represents an estimate from a subsample of data. Note the plots have different y-axis scales and the y-axes in <b>C</b> and <b>D</b> are segmented. Broadly, the accuracy of all estimators improves as <i>C<sub>p</sub></i> increases, and this increase is more pronounced for <i>DivE</i>. From <i>C<sub>p</sub></i>>0.1, <i>DivE</i> generally outperforms the existing estimators, but is prone to error at very low values of <i>C<sub>p</sub></i>., when the rarefaction curve implies a near-constant rate of species accumulation.</p

    Existing estimators underestimate diversity in HTLV-1 infection.

    No full text
    <p>For HTLV-1 Patient D, three samples are pooled. Rarefaction curves from the pooled sample (black circles) and a subsample (red circles) are shown. Chao1bc, ACE, Bootstrap, Good-Turing and negative exponential estimates (blue, grey, green, black, and orange lines respectively) from the subsample, and <i>DivE</i> estimates (red cross) from the same subsample are plotted. Existing estimators produce a single estimate of diversity, and so their estimates are shown as lines. The diversity in the blood must be at least as great as that observed by pooling the samples. All existing estimators estimate the total diversity to be less than that observed. Given that the observed diversity is likely to be a small fraction of the total diversity this represents a considerable error. We used <i>DivE</i> to produce two estimates: the diversity in the pooled sample (i.e. in 15000 cells, red cross) and the total diversity of the blood. <i>DivE</i> accurately estimates the pooled sample species richness from the subsample, but also predicts higher values of species richness in the blood, consistent with the unseen clones implied by the pooled rarefaction curve. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003646#pcbi.1003646.s003" target="_blank">Figure S3</a> for further examples.</p

    Comparison of estimators: Effect of sample size on estimated diversity.

    No full text
    <p>Normalized gradients measuring proportional increase in estimated diversity against proportional increase in sample size. Normalized gradients (shown for each estimator and each patient data set in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003646#pcbi.1003646.s008" target="_blank">Table S1</a>) were calculated by linear regression. For the HTLV-1 and microbial data, all estimators except <i>DivE</i> show large normalized gradients that are significantly positive. The TCR normalized gradients, though significantly positive, are small and do not show a substantial bias with sample size. *, **, and *** signify p<0.01, p<0.001, and p<0.0001 respectively; two-tailed binomial test (n = 14, 16, 20 for the HTLV-1, TCR and microbial data respectively).</p

    Comparison of species richness estimators.

    No full text
    <p><b>A–D</b> The Chao1bc (blue), ACE (grey), Bootstrap (green), Good-Turing (black), and negative-exponential estimators (orange) are applied to <i>in silico</i> random subsamples of observed data. Examples for HTLV-1, microbial, and TCR data are shown. Estimates systematically increase with sample size in datasets where rarefaction curves do not plateau (e.g. in <b>I</b>, <b>J</b>, <b>K</b>). Where rarefaction curves do plateau (e.g. in <b>L</b>), estimates are consistent. <b>E–H </b><i>DivE</i> (red) is applied to same subsamples as the other estimators. Performance of <i>DivE</i> was evaluated by comparing the error of estimates (<i>Ŝ<sub>obs</sub></i>), to the (known) number of species <i>S<sub>obs</sub></i> in the full observed data (purple line), i.e. error  = |<i>S<sub>obs</sub></i> - <i>Ŝ<sub>obs</sub></i>| /<i>S<sub>obs</sub></i>. In all datasets, <i>DivE</i> accurately estimates the species richness of the full observed data from subsamples of that data. <b>I–L</b> Corresponding HTLV-1, microbial and TCR rarefaction curves: arrows denote the size of the subsample to which each estimator was applied.</p
    corecore