76 research outputs found
Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments
A two-groups mixed-effects model for the comparison of (normalized)
microarray data from two treatment groups is considered. Most competing
parametric methods that have appeared in the literature are obtained as special
cases or by minor modification of the proposed model. Approximate maximum
likelihood fitting is accomplished via a fast and scalable algorithm, which we
call LEMMA (Laplace approximated EM Microarray Analysis). The posterior odds of
treatment gene interactions, derived from the model, involve shrinkage
estimates of both the interactions and of the gene specific error variances.
Genes are classified as being associated with treatment based on the posterior
odds and the local false discovery rate (f.d.r.) with a fixed cutoff. Our
model-based approach also allows one to declare the non-null status of a gene
by controlling the false discovery rate (FDR). It is shown in a detailed
simulation study that the approach outperforms well-known competitors. We also
apply the proposed methodology to two previously analyzed microarray examples.
Extensions of the proposed method to paired treatments and multiple treatments
are also discussed.Comment: Published in at http://dx.doi.org/10.1214/10-STS339 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions
In this paper, we propose a general class of algorithms for optimizing an
extensive variety of nonsmoothly penalized objective functions that satisfy
certain regularity conditions. The proposed framework utilizes the
majorization-minimization (MM) algorithm as its core optimization engine. The
resulting algorithms rely on iterated soft-thresholding, implemented
componentwise, allowing for fast, stable updating that avoids the need for any
high-dimensional matrix inversion. We establish a local convergence theory for
this class of algorithms under weaker assumptions than previously considered in
the statistical literature. We also demonstrate the exceptional effectiveness
of new acceleration methods, originally proposed for the EM algorithm, in this
class of problems. Simulation results and a microarray data example are
provided to demonstrate the algorithm's capabilities and versatility.Comment: A revised version of this paper has been published in the Electronic
Journal of Statistic
Online Updating of Statistical Inference in the Big Data Setting
We present statistical methods for big data arising from online analytical
processing, where large amounts of data arrive in streams and require fast
analysis without storage/access to the historical data. In particular, we
develop iterative estimating algorithms and statistical inferences for linear
models and estimating equations that update as new data arrive. These
algorithms are computationally efficient, minimally storage-intensive, and
allow for possible rank deficiencies in the subset design matrices due to
rare-event covariates. Within the linear model setting, the proposed
online-updating framework leads to predictive residual tests that can be used
to assess the goodness-of-fit of the hypothesized model. We also propose a new
online-updating estimator under the estimating equation setting. Theoretical
properties of the goodness-of-fit tests and proposed estimators are examined in
detail. In simulation studies and real data applications, our estimator
compares favorably with competing approaches under the estimating equation
setting.Comment: Submitted to Technometric
Generalized Wavelet Thresholding: Estimation and Hypothesis Testing with Applications to Array Comparative Genomic Hybridization
Wavelets have gained considerable popularity within the statistical arena in the context of nonparametric regression. When modeling data of the form y = f + \epsilon, the objective is to estimate the unknown `true' function f with small risk, based on sampled data y contaminated with random (usually Gaussian) noise \epsilon. Wavelet shrinkage and thresholding techniques have proved to be quite effective in recovering the true function f, particularly when f is spatially inhomogeneous.
Recently, Johnstone and Silverman (2005b) proposed using empirical Bayes methods for level-dependent threshold selection in wavelet shrinkage. Using the posterior median estimator, their approach amounts to a random thresholding procedure with impressive mean squared error (MSE) results. At each level, their approach considers a two-component mixture prior for each of the wavelet coefficients independently. This mixture prior inherently assumes that the wavelet coefficients are symmetrically distributed about zero.
Depending on the choice of wavelet filter and the interesting attributes of the true function, it may be the case that neither the magnitude nor the number of positive coefficients are equal to the those of the negative coefficients. Inspired by the work of Zhang (2005) and Zhang et al. (2007), this thesis introduces a random generalized thresholding procedure in the wavelet domain that does not require the symmetry assumption; it uses a three-component mixture prior that handles the positive and negative coefficients separately.
It is demonstrated that the proposed generalized wavelet thresholding procedure performs quite well when estimating f from a single sampled realization y. As in Johnstone and Silverman (2005b), the performance of the Maximal Overlap Discrete Wavelet Transform (MODWT) is substantially better than that of the standard Discrete Wavelet Transform (DWT) in terms of MSE and visual quality. An additional advantage for MODWT is that it is well-defined for any number of sampled points N, i.e., N need not be a power of two. The proposed procedure also performs well when estimating f from multiple noisy realizations y_i, i = 1,...,n.
In most, if not all, of the shrinkage and generalized shrinkage techniques considered, the noise standard deviation is assumed to be known and constant across the length of the function. In reality, it is typically not known and must be estimated. In the single realization setting, the estimate is usually taken to be a constant based on the median absolute deviation of the empirical wavelet coefficients at the finest decomposition level. With multiple realizations, there are more estimation options available. Various estimation options for a constant variance are examined via simulation. The results indicate that three of the six estimates considered are reasonable choices. The case of heterogeneous variances across the length of the function is also briefly explored via simulation.
Finally, an inferential procedure is proposed that first removes noise from individual observations via the generalized wavelet thresholding procedure, and then uses newly proposed F-like statistics (Cui et al., 2005; Hwang and Liu, 2006; Zhou, 2007) to compare populations of sampled observations. To demonstrate its applicability, the aforementioned statistical work is applied to datasets generated from Array Comparative Genomic Hybridization (aCGH) experiments
A Survey of Statistical Methods and Computing for Big Data
Abstract Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard software tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article reviews recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and sequential updating for stream data. Software review focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay
DNA Methylation Signatures Identify Biologically Distinct Subtypes in Acute Myeloid Leukemia
Abstract: We hypothesized that DNA methylation distributes into specific patterns in cancer cells, which reflect critical biological differences. We therefore examined the methylation profiles of 344 patients with acute myeloid leukemia (AML). Clustering of these patients by methylation data segregated patients into 16 groups. Five of these groups defined new AML subtypes that shared no other known feature. In addition, DNA methylation profiles segregated patients with CEBPA aberrations from other subtypes of leukemia, defined four epigenetically distinct forms of AML with NPM1 mutations, and showed that established AML1-ETO, CBFb-MYH11, and PML-RARA leukemia entities are associated with specific methylation profiles. We report a 15 gene methylation classifier predictive of overall survival in an independent patient cohort (p < 0.001, adjusted for known covariates)
Topics In Penalized Estimation
The use of regularization, or penalization, has become increasingly common in highdimensional statistical analysis over the past several years, where a common goal is to simultaneously select important variables and estimate their effects. This goal can be achieved by minimizing some parameter-dependent "goodness of fit" function (e.g., negative loglikelihood) subject to a penalization that promotes sparsity. Penalty functions that are nonsmooth (i.e., not differentiable) at the origin have received substantial attention, arguably beginning with LASSO (Tibshirani, 1996). This dissertation consists of three parts, each related to penalized estimation. First, a general class of algorithms is proposed for optimizing an extensive variety of nonsmoothly penalized objective functions that satisfy certain regularity conditions. The proposed framework utilizes the majorization-minimization (MM) algorithm as its core optimization engine. The resulting algorithms rely on iterated soft-thresholding, implemented componentwise, allowing for fast, stable updating that avoids the need for any high-dimensional matrix inversion. Local convergence theory is established for this class of algorithms under weaker assumptions than previously considered in the statistical literature. The second portion of this work extends the MM framework to finite mixture regression models, allowing for penalization among the regression coefficients within a potentially unknown number of components. Finally, a hierarchical structure imposed on the penalty parameter provides new motivation for the Minimax Concave Penalty of Zhang (2010). Frequentist and Bayesian risk of the MCP thresholding estimator and several other thresholding estimators are compared and explored in detail
- …