3,618 research outputs found
Effect of pooling samples on the efficiency of comparative studies using microarrays
Many biomedical experiments are carried out by pooling individual biological
samples. However, pooling samples can potentially hide biological variance and
give false confidence concerning the data significance. In the context of
microarray experiments for detecting differentially expressed genes, recent
publications have addressed the problem of the efficiency of sample-pooling,
and some approximate formulas were provided for the power and sample size
calculations. It is desirable to have exact formulas for these calculations and
have the approximate results checked against the exact ones. We show that the
difference between the approximate and exact results can be large. In this
study, we have characterized quantitatively the effect of pooling samples on
the efficiency of microarray experiments for the detection of differential gene
expression between two classes. We present exact formulas for calculating the
power of microarray experimental designs involving sample pooling and technical
replications. The formulas can be used to determine the total numbers of arrays
and biological subjects required in an experiment to achieve the desired power
at a given significance level. The conditions under which pooled design becomes
preferable to non-pooled design can then be derived given the unit cost
associated with a microarray and that with a biological subject. This paper
thus serves to provide guidance on sample pooling and cost effectiveness. The
formulation in this paper is outlined in the context of performing microarray
comparative studies, but its applicability is not limited to microarray
experiments. It is also applicable to a wide range of biomedical comparative
studies where sample pooling may be involved.Comment: 8 pages, 1 figure, 2 tables; to appear in Bioinformatic
On the utility of RNA sample pooling to optimize cost and statistical power in RNA sequencing experiments
Background: In gene expression studies, RNA sample pooling is sometimes considered because of budget constraints or lack of sufficient input material. Using microarray technology, RNA sample pooling strategies have been reported to optimize both the cost of data generation as well as the statistical power for differential gene expression (DGE) analysis. For RNA sequencing, with its different quantitative output in terms of counts and tunable dynamic range, the adequacy and empirical validation of RNA sample pooling strategies have not yet been evaluated. In this study, we comprehensively assessed the utility of pooling strategies in RNA-seq experiments using empirical and simulated RNA-seq datasets.
Result: The data generating model in pooled experiments is defined mathematically to evaluate the mean and variability of gene expression estimates. The model is further used to examine the trade-off between the statistical power of testing for DGE and the data generating costs. Empirical assessment of pooling strategies is done through analysis of RNA-seq datasets under various pooling and non-pooling experimental settings. Simulation study is also used to rank experimental scenarios with respect to the rate of false and true discoveries in DGE analysis. The results demonstrate that pooling strategies in RNA-seq studies can be both cost-effective and powerful when the number of pools, pool size and sequencing depth are optimally defined.
Conclusion: For high within-group gene expression variability, small RNA sample pools are effective to reduce the variability and compensate for the loss of the number of replicates. Unlike the typical cost-saving strategies, such as reducing sequencing depth or number of RNA samples (replicates), an adequate pooling strategy is effective in maintaining the power of testing DGE for genes with low to medium abundance levels, along with a substantial reduction of the total cost of the experiment. In general, pooling RNA samples or pooling RNA samples in conjunction with moderate reduction of the sequencing depth can be good options to optimize the cost and maintain the power
poolMC: Smart pooling of mRNA samples in microarray experiments
Background: Typically, pooling of mRNA samples in microarray experiments implies mixing mRNA from several biological-replicate samples before hybridization onto a microarray chip. Here we describe an alternative smart pooling strategy in which different samples, not necessarily biological replicates, are pooled in an information theoretic efficient way. Further, each sample is tested on multiple chips, but always in pools made up of different samples. The end goal is to exploit the compressibility of microarray data to reduce the number of chips used and increase the robustness to noise in measurements. Results: A theoretical framework to perform smart pooling of mRNA samples in microarray experiments was established and the software implementation of the pooling and decoding algorithms was developed in MATLAB. A proof-of-concept smart pooled experiment was performed using validated biological samples on commercially available gene chips. Conclusions: The theoretical developments and experimental demonstration in this paper provide a useful starting point to investigate smart pooling of mRNA samples in microarray experiments. Important conditions for its successful implementation include linearity of measurements, sparsity in data, and large experiment size.
Diverse correlation structures in gene expression data and their utility in improving statistical inference
It is well known that correlations in microarray data represent a serious
nuisance deteriorating the performance of gene selection procedures. This paper
is intended to demonstrate that the correlation structure of microarray data
provides a rich source of useful information. We discuss distinct correlation
substructures revealed in microarray gene expression data by an appropriate
ordering of genes. These substructures include stochastic proportionality of
expression signals in a large percentage of all gene pairs, negative
correlations hidden in ordered gene triples, and a long sequence of weakly
dependent random variables associated with ordered pairs of genes. The reported
striking regularities are of general biological interest and they also have
far-reaching implications for theory and practice of statistical methods of
microarray data analysis. We illustrate the latter point with a method for
testing differential expression of nonoverlapping gene pairs. While designed
for testing a different null hypothesis, this method provides an order of
magnitude more accurate control of type 1 error rate compared to conventional
methods of individual gene expression profiling. In addition, this method is
robust to the technical noise. Quantitative inference of the correlation
structure has the potential to extend the analysis of microarray data far
beyond currently practiced methods.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS120 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays
Volcano plot displays unstandardized signal (e.g. log-fold-change) against
noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from
the t test). We review the basic and an interactive use of the volcano plot,
and its crucial role in understanding the regularized t-statistic. The joint
filtering gene selection criterion based on regularized statistics has a curved
discriminant line in the volcano plot, as compared to the two perpendicular
lines for the "double filtering" criterion. This review attempts to provide an
unifying framework for discussions on alternative measures of differential
expression, improved methods for estimating variance, and visual display of a
microarray analysis result. We also discuss the possibility to apply volcano
plots to other fields beyond microarray.Comment: 8 figure
Latent rank change detection for analysis of splice-junction microarrays with nonlinear effects
Alternative splicing of gene transcripts greatly expands the functional
capacity of the genome, and certain splice isoforms may indicate specific
disease states such as cancer. Splice junction microarrays interrogate
thousands of splice junctions, but data analysis is difficult and error prone
because of the increased complexity compared to differential gene expression
analysis. We present Rank Change Detection (RCD) as a method to identify
differential splicing events based upon a straightforward probabilistic model
comparing the over- or underrepresentation of two or more competing isoforms.
RCD has advantages over commonly used methods because it is robust to false
positive errors due to nonlinear trends in microarray measurements. Further,
RCD does not depend on prior knowledge of splice isoforms, yet it takes
advantage of the inherent structure of mutually exclusive junctions, and it is
conceptually generalizable to other types of splicing arrays or RNA-Seq. RCD
specifically identifies the biologically important cases when a splice junction
becomes more or less prevalent compared to other mutually exclusive junctions.
The example data is from different cell lines of glioblastoma tumors assayed
with Agilent microarrays.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS389 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A critical evaluation of network and pathway based classifiers for outcome prediction in breast cancer
Recently, several classifiers that combine primary tumor data, like gene
expression data, and secondary data sources, such as protein-protein
interaction networks, have been proposed for predicting outcome in breast
cancer. In these approaches, new composite features are typically constructed
by aggregating the expression levels of several genes. The secondary data
sources are employed to guide this aggregation. Although many studies claim
that these approaches improve classification performance over single gene
classifiers, the gain in performance is difficult to assess. This stems mainly
from the fact that different breast cancer data sets and validation procedures
are employed to assess the performance. Here we address these issues by
employing a large cohort of six breast cancer data sets as benchmark set and by
performing an unbiased evaluation of the classification accuracies of the
different approaches. Contrary to previous claims, we find that composite
feature classifiers do not outperform simple single gene classifiers. We
investigate the effect of (1) the number of selected features; (2) the specific
gene set from which features are selected; (3) the size of the training set and
(4) the heterogeneity of the data set on the performance of composite feature
and single gene classifiers. Strikingly, we find that randomization of
secondary data sources, which destroys all biological information in these
sources, does not result in a deterioration in performance of composite feature
classifiers. Finally, we show that when a proper correction for gene set size
is performed, the stability of single gene sets is similar to the stability of
composite feature sets. Based on these results there is currently no reason to
prefer prognostic classifiers based on composite features over single gene
classifiers for predicting outcome in breast cancer
Multifactorial experimental design and the transitivity of ratios with spotted DNA microarrays
BACKGROUND: Multifactorial experimental designs using DNA microarrays are becoming increasingly common, but the extent of the transitivity of cDNA microarray expression measurements across multiple samples has yet to be explored. RESULTS: A strong correlation between direct and transitive inference for significantly differentially expressed genes is demonstrated, using subsets of a dye-swap loop design. CONCLUSIONS: In experimental design, opportunities for transitive inference should be exploited, while always ensuring that comparisons of greatest interest comprise direct hybridizations
- …