310 research outputs found
Sparse logistic principal components analysis for binary data
We develop a new principal components analysis (PCA) type dimension reduction
method for binary data. Different from the standard PCA which is defined on the
observed data, the proposed PCA is defined on the logit transform of the
success probabilities of the binary observations. Sparsity is introduced to the
principal component (PC) loading vectors for enhanced interpretability and more
stable extraction of the principal components. Our sparse PCA is formulated as
solving an optimization problem with a criterion function motivated from a
penalized Bernoulli likelihood. A Majorization--Minimization algorithm is
developed to efficiently solve the optimization problem. The effectiveness of
the proposed sparse logistic PCA method is illustrated by application to a
single nucleotide polymorphism data set and a simulation study.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS327 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Analyzing Multiple-Probe Microarray: Estimation and Application of Gene Expression Indexes
Gene expression index estimation is an essential step in analyzing multiple probe microarray data. Various modeling methods have been proposed in this area. Amidst all, a popular method proposed in Li and Wong (2001) is based on a multiplicative model, which is similar to the additive model discussed in Irizarry et al. (2003a) at the logarithm scale. Along this line, Hu et al. (2006) proposed data transformation to improve expression index estimation based on an ad hoc entropy criteria and naive grid search approach. In this work, we re-examined this problem using a new profile likelihood-based transformation estimation approach that is more statistically elegant and computationally efficient. We demonstrate the applicability of the proposed method using a benchmark Affymetrix U95A spiked-in experiment. Moreover, We introduced a new multivariate expression index and used the empirical study to shows its promise in terms of improving model fitting and power of detecting differential expression over the commonly used univariate expression index. As the other important content of the work, we discussed two generally encountered practical issues in application of gene expression index: normalization and summary statistic used for detecting differential expression. Our empirical study shows somewhat different findings from the MAQC project (MAQC, 2006)
Integrating Data Transformation in Principal Components Analysis
Principal component analysis (PCA) is a popular dimension-reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples. Supplementary materials for this article are available online
Asymptotic optimality and efficient computation of the leave-subject-out cross-validation
Although the leave-subject-out cross-validation (CV) has been widely used in
practice for tuning parameter selection for various nonparametric and
semiparametric models of longitudinal data, its theoretical property is unknown
and solving the associated optimization problem is computationally expensive,
especially when there are multiple tuning parameters. In this paper, by
focusing on the penalized spline method, we show that the leave-subject-out CV
is optimal in the sense that it is asymptotically equivalent to the empirical
squared error loss function minimization. An efficient Newton-type algorithm is
developed to compute the penalty parameters that optimize the CV criterion.
Simulated and real data are used to demonstrate the effectiveness of the
leave-subject-out CV in selecting both the penalty parameters and the working
correlation matrix.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1063 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Robust Estimation of the Correlation Matrix of Longitudinal Data
We propose a double-robust procedure for modeling the correlation matrix of a longitudinal dataset. It is based on an alternative Cholesky decomposition of the form Σ=DLL ⊤ D where D is a diagonal matrix proportional to the square roots of the diagonal entries of Σ and L is a unit lower-triangular matrix determining solely the correlation matrix. The first robustness is with respect to model misspecification for the innovation variances in D, and the second is robustness to outliers in the data. The latter is handled using heavy-tailed multivariate t-distributions with unknown degrees of freedom. We develop a Fisher scoring algorithm for computing the maximum likelihood estimator of the parameters when the nonredundant and unconstrained entries of (L,D) are modeled parsimoniously using covariates. We compare our results with those based on the modified Cholesky decomposition of the form LD 2 L ⊤ using simulations and a real dataset
Functional principal components analysis via penalized rank one approximation
Two existing approaches to functional principal components analysis (FPCA)
are due to Rice and Silverman (1991) and Silverman (1996), both based on
maximizing variance but introducing penalization in different ways. In this
article we propose an alternative approach to FPCA using penalized rank one
approximation to the data matrix. Our contributions are four-fold: (1) by
considering invariance under scale transformation of the measurements, the new
formulation sheds light on how regularization should be performed for FPCA and
suggests an efficient power algorithm for computation; (2) it naturally
incorporates spline smoothing of discretized functional data; (3) the
connection with smoothing splines also facilitates construction of
cross-validation or generalized cross-validation criteria for smoothing
parameter selection that allows efficient computation; (4) different smoothing
parameters are permitted for different FPCs. The methodology is illustrated
with a real data example and a simulation.Comment: Published in at http://dx.doi.org/10.1214/08-EJS218 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Robust regularized singular value decomposition with application to mortality data
We develop a robust regularized singular value decomposition (RobRSVD) method
for analyzing two-way functional data. The research is motivated by the
application of modeling human mortality as a smooth two-way function of age
group and year. The RobRSVD is formulated as a penalized loss minimization
problem where a robust loss function is used to measure the reconstruction
error of a low-rank matrix approximation of the data, and an appropriately
defined two-way roughness penalty function is used to ensure smoothness along
each of the two functional domains. By viewing the minimization problem as two
conditional regularized robust regressions, we develop a fast iterative
reweighted least squares algorithm to implement the method. Our implementation
naturally incorporates missing values. Furthermore, our formulation allows
rigorous derivation of leave-one-row/column-out cross-validation and
generalized cross-validation criteria, which enable computationally efficient
data-driven penalty parameter selection. The advantages of the new robust
method over nonrobust ones are shown via extensive simulation studies and the
mortality rate application.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS649 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Assessing Protein Conformational Sampling Methods Based on Bivariate Lag-Distributions of Backbone Angles
Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence–structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations
- …