23 research outputs found

    Replicates via co-expression pairs to extract the noisy samples.

    No full text
    <p>As an example, we use the BrainSpan RNA-seq dataset on 500 samples. We randomly added noise to one sample at a time at a given noise factor, and repeated this 100 times. (A) For each run, we generated an ROC, and show the average ROCs of these 100 runs, with the AUROC ranging between 0.5 and 1. Noise factor of around 5% was enough to disturb the experiments replicate, giving an AUROC of 0.73. (B) Average AUROCs varying across sample sizes and noise factors, with the standard error bar. Each point is an average of 100 repeats. The lines represent a linear fit of the data points. (C) Predicted noise factors for varying AUROCs shows the dependence on sample size and level of performance for the BrainSpan dataset.</p

    Additional file 2: Table S1. of Strength of functional signature correlates with effect size in autism

    No full text
    Functional convergence correlation/trend distributions. Table S2. Additional expression functional convergences and correlations/trends. Table S3. Additional gene properties functional convergences and correlations/trends. Table S4. Additional MSigDB functional convergences and correlations/trends. Table S5. Additional network connectivity functional convergences and correlations/trends. (XLSX 4939 kb

    Simpson’s paradox in expression data.

    No full text
    <p>(A) Sample-sample replicability plots are possible for each of the 18 conditions tested. Shown are a representative set of 9 sample-sample plots. Each point represents a gene with X and Y values being the two expression measures, i.e., original and replicate. Different samples are shown in different colors. (B) We overlay all these individual sample-sample plots onto the same axes to give the aggregate view of expression across samples and their replicates. This allows us to determine replication of gene variation across samples. (C) Some genes replicate well; changing their expression in a consistent way across samples and their replicates. We have highlighted genes which are positively correlated with their replicate across the samples and also have no overlapping dynamic ranges. Note that sample-sample correlations across this set of genes would be high even if the labels identifying samples were permuted (inset of “Gene A” shows labelled samples). (D) There are also genes which are negatively correlated with their replicates across samples (inset of “Gene B” shows labelled samples). And because they also do not have overlapping dynamic ranges the sample-sample correlation across these genes would remain high for any given sample pair.</p

    Additional file 1: Figure S1. of Strength of functional signature correlates with effect size in autism

    No full text
    Trend line robustness analysis. Figure S2. Functional convergences null for matched length and multifunctionality controls. Figure S3. Functional convergence correlation/trend distributions for GWAS. Figure S4. Functional convergence correlation/trend distributions for all network connectivity tests and all gene expression tests. Figure S5. Functional convergence correlation/trend distributions for all MSigDB collections. (DOCX 375 kb

    Estimating replicability using different gene-pairs.

    No full text
    <p>(A) Comparing the use of co-expressing pairs (solid), actual replicates (shaded) and random pairs (dashed) for the ENCODE dataset. Each point gives the average AUROC for detecting a noisy sample in the data across the varying noise factors. The sample size here was held constant at 18. The black lines are the original expression values, and the purple lines are those where we randomized the expression values (shuffling within an experiment). (B) Performance of the centiles for the experiment with and without the addition of noise. We only show the 50% noise factor. The replicates and stoichiometric pairs show very similar performance at this noise level (AUROC~1).</p

    Schematic of the AuPairWise method and guidelines for interpreting results.

    No full text
    <p>(A) Input into the script is an expression matrix. Noise is added to a sample at random, and an AUROC is calculated based on how well the perturbation is detected by the gene-pairs. This is repeated for multiple noise factors, which then allows us to estimate the amount of noise required to significantly disrupt the experiment, which is used as our metric for replicability. The outputs are summary files with the AUROCs and noise estimates, along with the summary plot. (B) Guidelines for interpreting results. We plot performance of the stoichiometric pair against the random pair. Two toy examples are shown as stars, with their corresponding p-values beneath them. High performance (AUROCs) on the stoichiometric pairs and low performance on the random pairs for a given noise factor implies a fair to good experiment (dark blue shading). Any results that fall closer to the identity line are less certain and probably contain systematic noise (lighter regions), and are likely poorer experiments (red regions).</p

    Individual genes have similar expression levels in many tissues.

    No full text
    <p>Samples replicate one another to some degree, regardless of the conditions under which they are measured, i.e., whether it is actually a biological replicate or not. (A) Liver expression levels within the same experiment are very highly correlated (Spearman’s <i>r</i><sub><i>s</i></sub> = 0.976, Pearson’s <i>r</i> = 0.999). (B) Liver expression is moderately correlated to kidney expression within the same experiment (Spearman’s <i>r</i><sub><i>s</i></sub> = 0.837, Pearson’s <i>r</i> = 0.894). (C) Liver expression levels in two different experiments are less well correlated than that between tissues (Spearman’s <i>r</i><sub><i>s</i></sub> = 0.846, Pearson’s <i>r</i> = 0.537).</p

    Co-expression perturbations to pick out samples with high variation.

    No full text
    <p>(A) Expression levels of genes X and Y. (B) Gene X and Y show good correlation (Spearman’s <i>r</i><sub><i>s</i></sub> = 0.9). (C) If we add noise to one sample, we see a shift. The noise model is described further in the Materials and Methods section. Briefly, for each gene in a sample, we select a new rank for it to take relative to its expression in other samples, thus sampling from within its empirical distribution. The new rank is limited to one close to the original value, as defined by the noise factor. (D) The noise added to the sample has caused it to be an outlier that is disrupting the co-expression indicated by the otherwise good linear fit. The residuals of the points scores the sample (regressing from the line of best fit), and the (E) scores in aggregate allows us to draw an (F) ROC and calculate an AUROC, testing how well we the outlier was detected. We also calculate a p-value (Wilcoxon test) to compare the distributions of the average AUROCs of the co-expressed pairs and an equal number of random pairs.</p

    Self-correlation (replicability) and co-expression.

    No full text
    <p>The degree of correlation between a gene and its replicate is plotted relative to all other genes (relative rank). As would be expected, as the correlation between a gene and its replicate (across the same conditions) rises, the rank of that correlation relative to the value between the given gene and all others also rises. However, the true replicate is only most similar to the given gene in ~20% of cases (solid black line, 4,024 genes), i.e., the solid line is at 0.2 when the rank is exactly 1 (dashed lines). The steep fall off in this trend shows that most replicates are at least very highly ranked by the correct gene.</p

    Low expression levels and high fold changes provide sensitive quality control.

    No full text
    <p>(A) Histogram of fraction of genes poorly replicable and filtered on mean expression (B) or fold change. (C) Plotting the average expression against the fold change to compare gene-gene replicability to the SEQC criteria. The red points are genes that were not well correlated (Pearson’s <i>r</i>< 0.9) with their replicates across conditions. The fraction of these red points across mean expression (log<sub>2</sub> FPKM) is shown in the histogram in panel A, and the fraction of these red points across fold change (log<sub>2</sub>) is shown in the panel B. The recommended filters by the SEQC are shown by the dotted blue lines, across both mean and fold change. We see that the fraction of poorly replicated genes drops significantly at the recommended filters–i.e., discarding fold changes less than log<sub>2</sub> 1~2, and discarding the lowly expressing genes (bottom 1/3<sup>rd</sup>). The grey lines show the histograms for a given measure (mean expression–A, fold change -B) contingent on the SEQC criterion for the other having already been applied.</p
    corecore