13,371 research outputs found
A mixture model approach to sample size estimation in two-sample comparative microarray experiments
Background: Choosing the appropriate sample size is an important step in the design of a
microarray experiment, and recently methods have been proposed that estimate sample sizes for
control of the False Discovery Rate (FDR). Many of these methods require knowledge of the
distribution of effect sizes among the differentially expressed genes. If this distribution can be
determined then accurate sample size requirements can be calculated.
Results: We present a mixture model approach to estimating the distribution of effect sizes in data
from two-sample comparative studies. Specifically, we present a novel, closed form, algorithm for
estimating the noncentrality parameters in the test statistic distributions of differentially expressed
genes. We then show how our model can be used to estimate sample sizes that control the FDR
together with other statistical measures like average power or the false nondiscovery rate. Method
performance is evaluated through a comparison with existing methods for sample size estimation,
and is found to be very good.
Conclusion: A novel method for estimating the appropriate sample size for a two-sample
comparative microarray study is presented. The method is shown to perform very well when
compared to existing methods
Recovering Sparse Signals Using Sparse Measurement Matrices in Compressed DNA Microarrays
Microarrays (DNA, protein, etc.) are massively parallel affinity-based biosensors capable of detecting and quantifying a large number of different genomic particles simultaneously. Among them, DNA microarrays comprising tens of thousands of probe spots are currently being employed to test multitude of targets in a single experiment. In conventional microarrays, each spot contains a large number of copies of a single probe designed to capture a single target, and, hence, collects only a single data point. This is a wasteful use of the sensing resources in comparative DNA microarray experiments, where a test sample is measured relative to a reference sample. Typically, only a fraction of the total number of genes represented by the two samples is differentially expressed, and, thus, a vast number of probe spots may not provide any useful information. To this end, we propose an alternative design, the so-called compressed microarrays, wherein each spot contains copies of several different probes and the total number of spots is potentially much smaller than the number of targets being tested. Fewer spots directly translates to significantly lower costs due to cheaper array manufacturing, simpler image acquisition and processing, and smaller amount of genomic material needed for experiments. To recover signals from compressed microarray measurements, we leverage ideas from compressive sampling. For sparse measurement matrices, we propose an algorithm that has significantly lower computational complexity than the widely used linear-programming-based methods, and can also recover signals with less sparsity
Differential expression analysis with global network adjustment
<p>Background: Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most profoundly affected. We present a method for predicting a geneâs expression as a function of other genes thereby accounting for the effect of transcriptional regulation that confounds the identification of genes differentially expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible regulator transcripts within a global network is on the order of thousands, and the number of biological samples is typically on the order of 10. Nevertheless, there are large gene expression databases that can be used to construct networks that could be helpful in modeling transcriptional regulation in smaller experiments.</p>
<p>Results: We demonstrate a type of penalized regression model that can be estimated from large gene expression databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second round of regression. Nevertheless, the proposed computationally efficient âover-shrinkageâ method outperforms previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We also show that a large proportion of gene dependencies are conditional on the biological state, which would be impossible with standard differential expression methods.</p>
<p>Conclusions: By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability of differential expression measures are greatly improved.</p>
M-quantile regression analysis of temporal gene expression data
In this paper, we explore the use of M-regression and M-quantile coefficients to detect statistical differences between temporal curves that belong to different experimental conditions. In particular, we consider the application of temporal gene expression data. Here, the aim is to detect genes whose temporal expression is significantly different across a number of biological conditions. We present a new method to approach this problem. Firstly, the temporal profiles of the genes are modelled by a parametric M-quantile regression model. This model is particularly appealing to small-sample gene
expression data, as it is very robust against outliers and it does not make any assumption on the error distribution. Secondly, we further increase the robustness of the method by summarising the M-quantile regression models for a large range of quantile values into an M-quantile coefficient. Finally, we employ a Hotelling T2-test to detect significant differences of the temporal M-quantile profiles across conditions. Simulated data shows the increased robustness of M-quantile regression methods over standard regression methods. We conclude by using the method to detect differentially expressed genes from time-course microarray data on muscular dystrophy
Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments
A two-groups mixed-effects model for the comparison of (normalized)
microarray data from two treatment groups is considered. Most competing
parametric methods that have appeared in the literature are obtained as special
cases or by minor modification of the proposed model. Approximate maximum
likelihood fitting is accomplished via a fast and scalable algorithm, which we
call LEMMA (Laplace approximated EM Microarray Analysis). The posterior odds of
treatment gene interactions, derived from the model, involve shrinkage
estimates of both the interactions and of the gene specific error variances.
Genes are classified as being associated with treatment based on the posterior
odds and the local false discovery rate (f.d.r.) with a fixed cutoff. Our
model-based approach also allows one to declare the non-null status of a gene
by controlling the false discovery rate (FDR). It is shown in a detailed
simulation study that the approach outperforms well-known competitors. We also
apply the proposed methodology to two previously analyzed microarray examples.
Extensions of the proposed method to paired treatments and multiple treatments
are also discussed.Comment: Published in at http://dx.doi.org/10.1214/10-STS339 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Microarrays, Empirical Bayes and the Two-Groups Model
The classic frequentist theory of hypothesis testing developed by Neyman,
Pearson and Fisher has a claim to being the twentieth century's most
influential piece of applied mathematics. Something new is happening in the
twenty-first century: high-throughput devices, such as microarrays, routinely
require simultaneous hypothesis tests for thousands of individual cases, not at
all what the classical theory had in mind. In these situations empirical Bayes
information begins to force itself upon frequentists and Bayesians alike. The
two-groups model is a simple Bayesian construction that facilitates empirical
Bayes analysis. This article concerns the interplay of Bayesian and frequentist
ideas in the two-groups setting, with particular attention focused on Benjamini
and Hochberg's False Discovery Rate method. Topics include the choice and
meaning of the null hypothesis in large-scale testing situations, power
considerations, the limitations of permutation methods, significance testing
for groups of cases (such as pathways in microarray studies), correlation
effects, multiple confidence intervals and Bayesian competitors to the
two-groups model.Comment: This paper commented in: [arXiv:0808.0582], [arXiv:0808.0593],
[arXiv:0808.0597], [arXiv:0808.0599]. Rejoinder in [arXiv:0808.0603].
Published in at http://dx.doi.org/10.1214/07-STS236 the Statistical Science
(http://www.imstat.org/sts/) by the Institute of Mathematical Statistics
(http://www.imstat.org
Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome
Tiling arrays make possible a large scale exploration of the genome thanks to
probes which cover the whole genome with very high density until 2 000 000
probes. Biological questions usually addressed are either the expression
difference between two conditions or the detection of transcribed regions. In
this work we propose to consider simultaneously both questions as an
unsupervised classification problem by modeling the joint distribution of the
two conditions. In contrast to previous methods, we account for all available
information on the probes as well as biological knowledge like annotation and
spatial dependence between probes. Since probes are not biologically relevant
units we propose a classification rule for non-connected regions covered by
several probes. Applications to transcriptomic and ChIP-chip data of
Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the
importance of a precise modeling and the region classification
Variance component score test for time-course gene set analysis of longitudinal RNA-seq data
As gene expression measurement technology is shifting from microarrays to
sequencing, the statistical tools available for their analysis must be adapted
since RNA-seq data are measured as counts. Recently, it has been proposed to
tackle the count nature of these data by modeling log-count reads per million
as continuous variables, using nonparametric regression to account for their
inherent heteroscedasticity. Adopting such a framework, we propose tcgsaseq, a
principled, model-free and efficient top-down method for detecting longitudinal
changes in RNA-seq gene sets. Considering gene sets defined a priori, tcgsaseq
identifies those whose expression vary over time, based on an original variance
component score test accounting for both covariates and heteroscedasticity
without assuming any specific parametric distribution for the transformed
counts. We demonstrate that despite the presence of a nonparametric component,
our test statistic has a simple form and limiting distribution, and both may be
computed quickly. A permutation version of the test is additionally proposed
for very small sample sizes. Applied to both simulated data and two real
datasets, the proposed method is shown to exhibit very good statistical
properties, with an increase in stability and power when compared to state of
the art methods ROAST, edgeR and DESeq2, which can fail to control the type I
error under certain realistic settings. We have made the method available for
the community in the R package tcgsaseq.Comment: 23 pages, 6 figures, typo corrections & acceptance acknowledgemen
- âŠ