12,138 research outputs found
Effect Size Estimation and Misclassification Rate Based Variable Selection in Linear Discriminant Analysis
Supervised classifying of biological samples based on genetic information,
(e.g. gene expression profiles) is an important problem in biostatistics. In
order to find both accurate and interpretable classification rules variable
selection is indispensable. This article explores how an assessment of the
individual importance of variables (effect size estimation) can be used to
perform variable selection. I review recent effect size estimation approaches
in the context of linear discriminant analysis (LDA) and propose a new
conceptually simple effect size estimation method which is at the same time
computationally efficient. I then show how to use effect sizes to perform
variable selection based on the misclassification rate which is the data
independent expectation of the prediction error. Simulation studies and real
data analyses illustrate that the proposed effect size estimation and variable
selection methods are competitive. Particularly, they lead to both compact and
interpretable feature sets.Comment: 21 pages, 2 figure
aFold â using polynomial uncertainty modelling for differential gene expression estimation from RNA sequencing data
Data normalization and identification of significant differential expression represent crucial steps in RNA-Seq analysis. Many available tools rely on assumptions that are often not met by real data, including the common assumption of symmetrical distribution of up- and down-regulated genes, the presence of only few differentially expressed genes and/or few outliers. Moreover, the cut-off for selecting significantly differentially expressed genes for further downstream analysis often depend on arbitrary choices
The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies
Reproducibility is a fundamental requirement in scientific experiments and clinical contexts. Recent publications raise concerns about the reliability of microarray technology because of the apparent lack of agreement between lists of differentially expressed genes (DEGs). In this study we demonstrate that (1) such discordance may stem from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion, the lists become much more reproducible, especially when fewer genes are selected; and (3) the instability of short DEG lists based on P cutoffs is an expected mathematical consequence of the high variability of the t-values. We recommend the use of FC ranking plus a non-stringent P cutoff as a baseline practice in order to generate more reproducible DEG lists. The FC criterion enhances reproducibility while the P criterion balances sensitivity and specificity
Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments
A two-groups mixed-effects model for the comparison of (normalized)
microarray data from two treatment groups is considered. Most competing
parametric methods that have appeared in the literature are obtained as special
cases or by minor modification of the proposed model. Approximate maximum
likelihood fitting is accomplished via a fast and scalable algorithm, which we
call LEMMA (Laplace approximated EM Microarray Analysis). The posterior odds of
treatment gene interactions, derived from the model, involve shrinkage
estimates of both the interactions and of the gene specific error variances.
Genes are classified as being associated with treatment based on the posterior
odds and the local false discovery rate (f.d.r.) with a fixed cutoff. Our
model-based approach also allows one to declare the non-null status of a gene
by controlling the false discovery rate (FDR). It is shown in a detailed
simulation study that the approach outperforms well-known competitors. We also
apply the proposed methodology to two previously analyzed microarray examples.
Extensions of the proposed method to paired treatments and multiple treatments
are also discussed.Comment: Published in at http://dx.doi.org/10.1214/10-STS339 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Bayesian estimation of Differential Transcript Usage from RNA-seq data
Next generation sequencing allows the identification of genes consisting of
differentially expressed transcripts, a term which usually refers to changes in
the overall expression level. A specific type of differential expression is
differential transcript usage (DTU) and targets changes in the relative within
gene expression of a transcript. The contribution of this paper is to: (a)
extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian
model which is originally designed for identifying changes in overall
expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist
model for inferring DTU. cjBitSeq is a read based model and performs fully
Bayesian inference by MCMC sampling on the space of latent state of each
transcript per gene. BayesDRIMSeq is a count based model and estimates the
Bayes Factor of a DTU model against a null model using Laplace's approximation.
The proposed models are benchmarked against the existing ones using a recent
independent simulation study as well as a real RNA-seq dataset. Our results
suggest that the Bayesian methods exhibit similar performance with DRIMSeq in
terms of precision/recall but offer better calibration of False Discovery Rate.Comment: Revised version, accepted to Statistical Applications in Genetics and
Molecular Biolog
Differential expression analysis with global network adjustment
<p>Background: Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most profoundly affected. We present a method for predicting a geneâs expression as a function of other genes thereby accounting for the effect of transcriptional regulation that confounds the identification of genes differentially expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible regulator transcripts within a global network is on the order of thousands, and the number of biological samples is typically on the order of 10. Nevertheless, there are large gene expression databases that can be used to construct networks that could be helpful in modeling transcriptional regulation in smaller experiments.</p>
<p>Results: We demonstrate a type of penalized regression model that can be estimated from large gene expression databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second round of regression. Nevertheless, the proposed computationally efficient âover-shrinkageâ method outperforms previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We also show that a large proportion of gene dependencies are conditional on the biological state, which would be impossible with standard differential expression methods.</p>
<p>Conclusions: By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability of differential expression measures are greatly improved.</p>
A nonparametric empirical Bayes framework for large-scale multiple testing
We propose a flexible and identifiable version of the two-groups model,
motivated by hierarchical Bayes considerations, that features an empirical null
and a semiparametric mixture model for the non-null cases. We use a
computationally efficient predictive recursion marginal likelihood procedure to
estimate the model parameters, even the nonparametric mixing distribution. This
leads to a nonparametric empirical Bayes testing procedure, which we call
PRtest, based on thresholding the estimated local false discovery rates.
Simulations and real-data examples demonstrate that, compared to existing
approaches, PRtest's careful handling of the non-null density can give a much
better fit in the tails of the mixture distribution which, in turn, can lead to
more realistic conclusions.Comment: 18 pages, 4 figures, 3 table
Testing significance of features by lassoed principal components
We consider the problem of testing the significance of features in
high-dimensional settings. In particular, we test for differentially-expressed
genes in a microarray experiment. We wish to identify genes that are associated
with some type of outcome, such as survival time or cancer type. We propose a
new procedure, called Lassoed Principal Components (LPC), that builds upon
existing methods and can provide a sizable improvement. For instance, in the
case of two-class data, a standard (albeit simple) approach might be to compute
a two-sample -statistic for each gene. The LPC method involves projecting
these conventional gene scores onto the eigenvectors of the gene expression
data covariance matrix and then applying an penalty in order to de-noise
the resulting projections. We present a theoretical framework under which LPC
is the logical choice for identifying significant genes, and we show that LPC
can provide a marked reduction in false discovery rates over the conventional
methods on both real and simulated data. Moreover, this flexible procedure can
be applied to a variety of types of data and can be used to improve many
existing methods for the identification of significant features.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS182 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Generalized empirical Bayesian methods for discovery of differential data in high-throughput biology
Motivation:
High-throughput data are now commonplace in biological research. Rapidly changing technologies and application mean that novel methods for detecting differential behaviour that account for a âlarge P, small nâ setting are required at an increasing rate. The development of such methods is, in general, being done on an ad hoc basis, requiring further development cycles and a lack of standardization between analyses.
Results:
We present here a generalized method for identifying differential behaviour within high-throughput biological data through empirical Bayesian methods. This approach is based on our baySeq algorithm for identification of differential expression in RNA-seq data based on a negative binomial distribution, and in paired data based on a beta-binomial distribution. Here we show how the same empirical Bayesian approach can be applied to any parametric distribution, removing the need for lengthy development of novel methods for differently distributed data. Comparisons with existing methods developed to address specific problems in high-throughput biological data show that these generic methods can achieve equivalent or better performance. A number of enhancements to the basic algorithm are also presented to increase flexibility and reduce computational costs.
Availability and implementation:
The methods are implemented in the R baySeq (v2) package, available on Bioconductor http://www.bioconductor.org/packages/release/bioc/html/baySeq.html.
Contact: [email protected]
Supplementary information:
Supplementary data are available at Bioinformatics online.This work was supported by European Research Council Advanced Investigator Grant ERC-2013-AdG 340642 â TRIBE.This is the author accepted manuscript. The final version is available from Oxford University Press via http://dx.doi.org/10.1093/bioinformatics/btv56
- âŠ