354 research outputs found
Sparse logistic principal components analysis for binary data
We develop a new principal components analysis (PCA) type dimension reduction
method for binary data. Different from the standard PCA which is defined on the
observed data, the proposed PCA is defined on the logit transform of the
success probabilities of the binary observations. Sparsity is introduced to the
principal component (PC) loading vectors for enhanced interpretability and more
stable extraction of the principal components. Our sparse PCA is formulated as
solving an optimization problem with a criterion function motivated from a
penalized Bernoulli likelihood. A Majorization--Minimization algorithm is
developed to efficiently solve the optimization problem. The effectiveness of
the proposed sparse logistic PCA method is illustrated by application to a
single nucleotide polymorphism data set and a simulation study.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS327 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Integrating Data Transformation in Principal Components Analysis
Principal component analysis (PCA) is a popular dimension-reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples. Supplementary materials for this article are available online
Analyzing Multiple-Probe Microarray: Estimation and Application of Gene Expression Indexes
Gene expression index estimation is an essential step in analyzing multiple probe microarray data. Various modeling methods have been proposed in this area. Amidst all, a popular method proposed in Li and Wong (2001) is based on a multiplicative model, which is similar to the additive model discussed in Irizarry et al. (2003a) at the logarithm scale. Along this line, Hu et al. (2006) proposed data transformation to improve expression index estimation based on an ad hoc entropy criteria and naive grid search approach. In this work, we re-examined this problem using a new profile likelihood-based transformation estimation approach that is more statistically elegant and computationally efficient. We demonstrate the applicability of the proposed method using a benchmark Affymetrix U95A spiked-in experiment. Moreover, We introduced a new multivariate expression index and used the empirical study to shows its promise in terms of improving model fitting and power of detecting differential expression over the commonly used univariate expression index. As the other important content of the work, we discussed two generally encountered practical issues in application of gene expression index: normalization and summary statistic used for detecting differential expression. Our empirical study shows somewhat different findings from the MAQC project (MAQC, 2006)
Asymptotic optimality and efficient computation of the leave-subject-out cross-validation
Although the leave-subject-out cross-validation (CV) has been widely used in
practice for tuning parameter selection for various nonparametric and
semiparametric models of longitudinal data, its theoretical property is unknown
and solving the associated optimization problem is computationally expensive,
especially when there are multiple tuning parameters. In this paper, by
focusing on the penalized spline method, we show that the leave-subject-out CV
is optimal in the sense that it is asymptotically equivalent to the empirical
squared error loss function minimization. An efficient Newton-type algorithm is
developed to compute the penalty parameters that optimize the CV criterion.
Simulated and real data are used to demonstrate the effectiveness of the
leave-subject-out CV in selecting both the penalty parameters and the working
correlation matrix.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1063 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Efficient semiparametric estimation in generalized partially linear additive models for longitudinal/clustered data
We consider efficient estimation of the Euclidean parameters in a generalized
partially linear additive models for longitudinal/clustered data when multiple
covariates need to be modeled nonparametrically, and propose an estimation
procedure based on a spline approximation of the nonparametric part of the
model and the generalized estimating equations (GEE). Although the model in
consideration is natural and useful in many practical applications, the
literature on this model is very limited because of challenges in dealing with
dependent data for nonparametric additive models. We show that the proposed
estimators are consistent and asymptotically normal even if the covariance
structure is misspecified. An explicit consistent estimate of the asymptotic
variance is also provided. Moreover, we derive the semiparametric efficiency
score and information bound under general moment conditions. By showing that
our estimators achieve the semiparametric information bound, we effectively
establish their efficiency in a stronger sense than what is typically
considered for GEE. The derivation of our asymptotic results relies heavily on
the empirical processes tools that we develop for the longitudinal/clustered
data. Numerical results are used to illustrate the finite sample performance of
the proposed estimators.Comment: Published in at http://dx.doi.org/10.3150/12-BEJ479 the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors
This paper investigates the cross-correlations across multiple climate model
errors. We build a Bayesian hierarchical model that accounts for the spatial
dependence of individual models as well as cross-covariances across different
climate models. Our method allows for a nonseparable and nonstationary
cross-covariance structure. We also present a covariance approximation approach
to facilitate the computation in the modeling and analysis of very large
multivariate spatial data sets. The covariance approximation consists of two
parts: a reduced-rank part to capture the large-scale spatial dependence, and a
sparse covariance matrix to correct the small-scale dependence error induced by
the reduced rank approximation. We pay special attention to the case that the
second part of the approximation has a block-diagonal structure. Simulation
results of model fitting and prediction show substantial improvement of the
proposed approximation over the predictive process approximation and the
independent blocks analysis. We then apply our computational approach to the
joint statistical modeling of multiple climate model errors.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS478 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Functional principal components analysis via penalized rank one approximation
Two existing approaches to functional principal components analysis (FPCA)
are due to Rice and Silverman (1991) and Silverman (1996), both based on
maximizing variance but introducing penalization in different ways. In this
article we propose an alternative approach to FPCA using penalized rank one
approximation to the data matrix. Our contributions are four-fold: (1) by
considering invariance under scale transformation of the measurements, the new
formulation sheds light on how regularization should be performed for FPCA and
suggests an efficient power algorithm for computation; (2) it naturally
incorporates spline smoothing of discretized functional data; (3) the
connection with smoothing splines also facilitates construction of
cross-validation or generalized cross-validation criteria for smoothing
parameter selection that allows efficient computation; (4) different smoothing
parameters are permitted for different FPCs. The methodology is illustrated
with a real data example and a simulation.Comment: Published in at http://dx.doi.org/10.1214/08-EJS218 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Assessing Protein Conformational Sampling Methods Based on Bivariate Lag-Distributions of Backbone Angles
Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence–structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations
Functional dynamic factor models with application to yield curve forecasting
Accurate forecasting of zero coupon bond yields for a continuum of maturities
is paramount to bond portfolio management and derivative security pricing. Yet
a universal model for yield curve forecasting has been elusive, and prior
attempts often resulted in a trade-off between goodness of fit and consistency
with economic theory. To address this, herein we propose a novel formulation
which connects the dynamic factor model (DFM) framework with concepts from
functional data analysis: a DFM with functional factor loading curves. This
results in a model capable of forecasting functional time series. Further, in
the yield curve context we show that the model retains economic interpretation.
Model estimation is achieved through an expectation-maximization algorithm,
where the time series parameters and factor loading curves are simultaneously
estimated in a single step. Efficient computing is implemented and a
data-driven smoothing parameter is nicely incorporated. We show that our model
performs very well on forecasting actual yield data compared with existing
approaches, especially in regard to profit-based assessment for an innovative
trading exercise. We further illustrate the viability of our model to
applications outside of yield forecasting.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS551 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …