32 research outputs found

    Displaying Variation in Large Datasets: Plotting a Visual Summary of Effect Sizes

    No full text
    <p>Displaying the component-wise between-group differences high-dimensional datasets is problematic because widely used plots such as Bland–Altman and Volcano plots do not show what they are colloquially <i>believed</i> to show. Thus, it is difficult for the experimentalist to grasp why the between-group difference of one component is “significant” while that of another component is not. Here, we propose a type of “Effect Plot” that displays between-group differences in relation to respective underlying variability for every component of a high-dimensional dataset. We use synthetic data to show that such a plot captures the essence of what determines “significance” for between-group differences in each component, and provide guidance in the interpretation of the plot. Supplementary online materials contain the code and data for this article and include simple R functions to produce an effect plot from suitable datasets.</p

    Sampling and technical variance is distinct from within-condition variance.

    No full text
    <p>A comparison of within-condition gene expression-difference to the median expression level for three different experiments. The left panel shows the sampling variance for comparison and the three experiments are shown in subsequent panels. The L-K RNA-Seq data set compares gene expression in two liver and two kidney samples. The <i>Bacillus cereus</i> RNA-Seq data set for samples grown at neutral pH and two samples 20 minutes after shift to grown at low pH. Meta-RNA-Seq data is for microbial gene expression analysis of four clinical vaginal samples from two women with a healthy microbiota and two women with a microbiota indicative of bacterial vaginosis. The RNA-Seq experiments in the L-K and <i>B. cereus</i> datasets were from controlled conditions with identical gene content per condition and show that the vast majority of highly-expressed genes have small within-condition estimates of , and that estimate only becomes imprecise as becomes very small. The Meta-RNA-Seq panel shows that when within-condition variance is high, there is no relationship between the expression level and the within-condition variance. Note that base-2 logarithms were used throughout.</p

    ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq

    Get PDF
    <div><p>Experimental variance is a major challenge when dealing with high-throughput sequencing data. This variance has several sources: sampling replication, technical replication, variability <i>within</i> biological conditions, and variability <i>between</i> biological conditions. The high per-sample cost of RNA-Seq often precludes the large number of experiments needed to partition observed variance into these categories as per standard ANOVA models. We show that the partitioning of within-condition to between-condition variation cannot reasonably be ignored, whether in single-organism RNA-Seq or in Meta-RNA-Seq experiments, and further find that commonly-used RNA-Seq analysis tools, as described in the literature, do not enforce the constraint that the sum of relative expression levels must be one, and thus report expression levels that are systematically distorted. These two factors lead to misleading inferences if not properly accommodated. As it is usually only the biological between-condition and within-condition differences that are of interest, we developed ALDEx, an ANOVA-like differential expression procedure, to identify genes with greater between- to within-condition differences. We show that the presence of differential expression and the magnitude of these comparative differences can be reasonably estimated with even very small sample sizes.</p></div

    Venn diagram of the four differential expression methods in the <i>B. cereus</i> dataset.

    No full text
    <p>Transcript abundances were identified as differential as in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0067019#pone-0067019-g005" target="_blank">Figure 5</a>. The overlap between the number of differentially expressed transcripts for each method is given in the individual cells of the diagram. The number of differentially regulated transcripts for each method is: ALDEx 1614, DESeq 1587, edgeR 1393, CuffDiff 1465. The diagram was prepared using the Venny web tool(Available: <a href="http://bioinfogp.cnb.csic.es/tools/venny" target="_blank">http://bioinfogp.cnb.csic.es/tools/venny</a>. Accessed May 23, 2013) <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0067019#pone.0067019-Oliveros1" target="_blank">[57]</a>.</p

    Sample characteristics

    No full text
    <p>Sample names, coding sequence (CDS) numbers, the range in mappable reads per sample, and the number of genes with 0 reads in any sample are given.</p

    Dirichlet-distributed proportions accurately account for the sampling variance.

    No full text
    <p>This plot overlays the expected range between the 1–99% quantiles for with observed range of computed for the Liver library replicates in the L–K dataset. Marioni et al<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0067019#pone.0067019-Marioni1" target="_blank">[19]</a> minimized technical error with an experimental design where the same Illumina library was run in two separate lanes. Monte Carlo Estimates of are shown in red with the density of the values shown in orange, while 1–99% expected quantile ranges from the Dirichlet are shown in black. This demonstrates that the error inherent in high-throughput sequencing is greatest when the counts are small and least when the counts are large. The near-perfect overlay of actual and modelled values strongly support idea that modelling proportions through a Dirichlet-multinomial process accurately accounts for the sampling variance inherent in RNA-Seq, and by extension in other high-throughput sequencing analyses. The error in estimating the expected quantiles is observable by the size of the points plotted in black and becomes small when expression is non-trivial. Values on the -axis were calculated with the given formula for and were adjusted to remove the non-informative subspace as outlined in the text. Thus the -axis value of zero corresponds to the expected per-gene log-expression value.</p

    Meta sample taxonomic abundance

    No full text
    <p>The organism name and proportional abundance for the two clinical samples with normal Nugent scores (N) and the two samples with high Nugent scores indicitive of bacterial vaginosis (BV) are given. Totals may not sum to 1 because of rounding errors.</p

    Comparison of four differential expression methods in the <i>B. cereus</i> dataset.

    No full text
    <p>Transcript abundances identified as differential by the first three methods are highlighted in red on a background density plot. Default false discovery rates for each program were used, 0.05 for CuffDiff and 0.1 for both edgeR and DESeq, since these reflect the configurations in which most users will use these programs. In the case of ALDEx transcripts with are highlighted in red and orange for and . Transcripts originating from genes contained on the plasmid that is found in one sample from each condition are circled. The top row shows typical Bland-Altman style (MA) plots where the median absolute fold change () is plotted vs. the mean expression value (Expression). The mean expression value on x-axis is 0 for the reasons outlined in the text. Notice that the edgeR method identifies differentially-expressed transcripts with much lower abundances than the other three methods. The plasmid-encoded genes are not differentiated on the Bland-Altman-style plots. The bottom row shows an MW plot of the median absolute fold change between-conditions () vs. the maximum within-condition difference () of the same data. Here it is clear that transcripts originating from the plasmid-encoded genes exhibit very large values. Interestingly, there are a number of chromosomally-encoded genes in this dataset, and in the other two (see previous figure) that also show large values, demonstrating that within-condition variation can be problematic even for samples derived from well-controlled conditions. Both CuffDiff and edgeR identified as differentially expressed a significant fraction of the plasmid-derived transcripts.</p

    True and False positive identification in simulated data.

    No full text
    <p>A set of eleven genes with simulated read counts between 1 and 1024 in two-fold increments were appended twice to a single sample of the <i>B. cereus</i> dataset. Two conditions were generated by multiplying the counts for a single set of simulated genes in each condition by the fold-difference values indicated in the True positive panel on the left, and two simulated technical replicates were generated for each condition by sampling from the Dirichlet distribution which accurately models technical variance in these datasets (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0067019#pone-0067019-g001" target="_blank">Figure 1</a>). The resulting four samples were examined by DESeq, edgeR and ALDEx for the ability of each method to identify the simulated differentially-expressed genes. The fold change varied between 1.1 and 10 and 100 simulations were run for each fold change. The fold change value is overlaid on the corresponding curve in the left panel. The line colors are black for edgeR, blue for DESeq and red for ALDEx, and the symbols are the same for each fold change value across each method. The ALDEx cutoff of 1.5 is a solid and 2.0 is a dashed line. The right panel shows the per-gene false positive rate for each method at two cutoffs. False positive events in this model can only arise through outliers in the Dirichlet sampling procedure. The rate was calculated by dividing the number of false positive genes identified in each trial by the number of genes in the dataset (5358). The boxplot shows the range of false positive rates observed for each method across all trials and all expression levels. A rate of 0.0002 corresponds to approximately 1 false positive per trial in this dataset.</p
    corecore