35 research outputs found

    Stochastic principles governing alternative splicing of RNA

    No full text
    <div><p>The dominance of the major transcript isoform relative to other isoforms from the same gene generated by alternative splicing (AS) is essential to the maintenance of normal cellular physiology. However, the underlying principles that determine such dominance remain unknown. Here, we analyzed the physical AS process and found that it can be modeled by a stochastic minimization process, which causes the scaled expression levels of all transcript isoforms to follow the same Weibull extreme value distribution. Surprisingly, we also found a simple equation to describe the median frequency of transcript isoforms of different dominance. This two-parameter Weibull model provides the statistical distribution of all isoforms of all transcribed genes, and reveals that previously unexplained observations concerning relative isoform expression derive from these principles.</p></div

    The frequency distribution of the <i>k</i>th dominant transcript isoform.

    No full text
    <p>(A) <i>k</i> = 1. (B) <i>k</i> = 2. <i>k</i> is the rank of a transcript isoform. <i>M</i> is the number of transcript isoforms for a gene. Black curves represent frequency distribution of the experimental RNA-seq data. Red curves represent the frequency distribution of the simulated data from Weibull distribution <i>W(0</i>.<i>39)</i>. KLd is the Kullback-Leibler divergence between the two distributions.</p

    A model of alternative splicing.

    No full text
    <p>(A) Splicing factor U1 and U2AF search the 5’ GU and 3’ AG splicing sites by 3D and 1D Brownian motion. Multiple candidate splice sites compete for the binding of U1 and U2AF. The binding is ATP-independent and reversible. (B) The binding of U1 and U2AF to the splice sites becomes stable only after the ATP-dependent binding of U2 snRNP. The identification of each intron is equivalent to a minimization process that U1 and U2AF dynamically search their global or local minimal energy sites on the pre-mRNA segment presented for AS. (C) The scaled expression level of transcript isoform follows type III extreme value distribution—a Weibull distribution. The approximate values of parameters <i>a (0</i>.<i>44)</i> and <i>b (0</i>.<i>6)</i> are estimated by curve fitting. Black curve represents the distribution of scaled expression level from experimental data. Red curve represent the Weibull distribution produced by curve fitting.</p

    Transcript isoform expression pattern of two genes in different conditions.

    No full text
    <p>(A) BRD4. (B) SRSF7. Among 11 transcript isoforms of BRD4 and 12 transcript isoforms of SRSF7, ENST00000371835 and ENST00000409276 are the most dominant isoforms in all four activated conditions, ENST00000263377 and ENST00000477635 are the most dominant isoforms in all four resting conditions, respectively. This result indicates the major transcript isoform can be regulated by single external signal.</p

    Quantification of HTLV-1 Clonality and TCR Diversity

    Get PDF
    <div><p>Estimation of immunological and microbiological diversity is vital to our understanding of infection and the immune response. For instance, what is the diversity of the T cell repertoire? These questions are partially addressed by high-throughput sequencing techniques that enable identification of immunological and microbiological “species” in a sample. Estimators of the number of unseen species are needed to estimate population diversity from sample diversity. Here we test five widely used non-parametric estimators, and develop and validate a novel method, <i>DivE</i>, to estimate species richness and distribution. We used three independent datasets: (i) viral populations from subjects infected with human T-lymphotropic virus type 1; (ii) T cell antigen receptor clonotype repertoires; and (iii) microbial data from infant faecal samples. When applied to datasets with rarefaction curves that did not plateau, existing estimators systematically increased with sample size. In contrast, <i>DivE</i> consistently and accurately estimated diversity for all datasets. We identify conditions that limit the application of <i>DivE</i>. We also show that <i>DivE</i> can be used to accurately estimate the underlying population frequency distribution. We have developed a novel method that is significantly more accurate than commonly used biodiversity estimators in microbiological and immunological populations.</p></div

    Comparison of estimator performance for TCR data.

    No full text
    <p>*Median absolute percentage error between <i>S<sub>obs</sub></i> and <i>Ŝ<sub>obs</sub></i>.</p>†<p>p-value of the significance of the differences between the errors of each estimator and <i>DivE</i> (n = 24; two-tailed binomial test).</p

    Validation of <i>DivE</i> distribution generation algorithm.

    No full text
    <p>The <i>DivE</i> distribution generation algorithm (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003646#pcbi-1003646-g002" target="_blank">Figure 2</a>) was applied to random samples (red dashed) of observed data (black solid). Accuracy was evaluated by comparing the estimated distribution (orange dashed) to the true distribution of the full observed data (black). Examples for HTLV-1 <b>A</b>, TCR <b>B</b> and microbial datasets <b>C</b> are shown.</p

    Existing estimators underestimate diversity in HTLV-1 infection.

    No full text
    <p>For HTLV-1 Patient D, three samples are pooled. Rarefaction curves from the pooled sample (black circles) and a subsample (red circles) are shown. Chao1bc, ACE, Bootstrap, Good-Turing and negative exponential estimates (blue, grey, green, black, and orange lines respectively) from the subsample, and <i>DivE</i> estimates (red cross) from the same subsample are plotted. Existing estimators produce a single estimate of diversity, and so their estimates are shown as lines. The diversity in the blood must be at least as great as that observed by pooling the samples. All existing estimators estimate the total diversity to be less than that observed. Given that the observed diversity is likely to be a small fraction of the total diversity this represents a considerable error. We used <i>DivE</i> to produce two estimates: the diversity in the pooled sample (i.e. in 15000 cells, red cross) and the total diversity of the blood. <i>DivE</i> accurately estimates the pooled sample species richness from the subsample, but also predicts higher values of species richness in the blood, consistent with the unseen clones implied by the pooled rarefaction curve. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003646#pcbi.1003646.s003" target="_blank">Figure S3</a> for further examples.</p

    Comparison of estimators: Effect of sample size on estimated diversity.

    No full text
    <p>Normalized gradients measuring proportional increase in estimated diversity against proportional increase in sample size. Normalized gradients (shown for each estimator and each patient data set in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003646#pcbi.1003646.s008" target="_blank">Table S1</a>) were calculated by linear regression. For the HTLV-1 and microbial data, all estimators except <i>DivE</i> show large normalized gradients that are significantly positive. The TCR normalized gradients, though significantly positive, are small and do not show a substantial bias with sample size. *, **, and *** signify p<0.01, p<0.001, and p<0.0001 respectively; two-tailed binomial test (n = 14, 16, 20 for the HTLV-1, TCR and microbial data respectively).</p

    Performance of <i>DivE</i> frequency distribution generation algorithm.

    No full text
    <p>*Mean error across all subjects and all small subsamples, for each data source. Small subsamples were defined as those ≤50% of the size of the observed each patient data set. Error defined as the sum of absolute discrepancies between true and estimated frequency distributions, divided by area under true distribution.</p>†<p>Mean percentage error across all subjects and all small subsamples in the Gini coefficients of the true and estimated distributions.</p
    corecore