8,939 research outputs found
Provable Sparse Tensor Decomposition
We propose a novel sparse tensor decomposition method, namely Tensor
Truncated Power (TTP) method, that incorporates variable selection into the
estimation of decomposition components. The sparsity is achieved via an
efficient truncation step embedded in the tensor power iteration. Our method
applies to a broad family of high dimensional latent variable models, including
high dimensional Gaussian mixture and mixtures of sparse regressions. A
thorough theoretical investigation is further conducted. In particular, we show
that the final decomposition estimator is guaranteed to achieve a local
statistical rate, and further strengthen it to the global statistical rate by
introducing a proper initialization procedure. In high dimensional regimes, the
obtained statistical rate significantly improves those shown in the existing
non-sparse decomposition methods. The empirical advantages of TTP are confirmed
in extensive simulated results and two real applications of click-through rate
prediction and high-dimensional gene clustering.Comment: To Appear in JRSS-
High Dimensional Semiparametric Latent Graphical Model for Mixed Data
Graphical models are commonly used tools for modeling multivariate random
variables. While there exist many convenient multivariate distributions such as
Gaussian distribution for continuous data, mixed data with the presence of
discrete variables or a combination of both continuous and discrete variables
poses new challenges in statistical modeling. In this paper, we propose a
semiparametric model named latent Gaussian copula model for binary and mixed
data. The observed binary data are assumed to be obtained by dichotomizing a
latent variable satisfying the Gaussian copula distribution or the
nonparanormal distribution. The latent Gaussian model with the assumption that
the latent variables are multivariate Gaussian is a special case of the
proposed model. A novel rank-based approach is proposed for both latent graph
estimation and latent principal component analysis. Theoretically, the proposed
methods achieve the same rates of convergence for both precision matrix
estimation and eigenvector estimation, as if the latent variables were
observed. Under similar conditions, the consistency of graph structure recovery
and feature selection for leading eigenvectors is established. The performance
of the proposed methods is numerically assessed through simulation studies, and
the usage of our methods is illustrated by a genetic dataset.Comment: 34 pages, 2 figures, 4 table
Truncated Power Method for Sparse Eigenvalue Problems
This paper considers the sparse eigenvalue problem, which is to extract
dominant (largest) sparse eigenvectors with at most non-zero components. We
propose a simple yet effective solution called truncated power method that can
approximately solve the underlying nonconvex optimization problem. A strong
sparse recovery result is proved for the truncated power method, and this
theory is our key motivation for developing the new algorithm. The proposed
method is tested on applications such as sparse principal component analysis
and the densest -subgraph problem. Extensive experiments on several
synthetic and real-world large scale datasets demonstrate the competitive
empirical performance of our method
Robust sparse Gaussian graphical modeling
Gaussian graphical modeling has been widely used to explore various network
structures, such as gene regulatory networks and social networks. We often use
a penalized maximum likelihood approach with the penalty for learning a
high-dimensional graphical model. However, the penalized maximum likelihood
procedure is sensitive to outliers. To overcome this problem, we introduce a
robust estimation procedure based on the -divergence. The proposed
method has a redescending property, which is known as a desirable property in
robust statistics. The parameter estimation procedure is constructed using the
Majorize-Minimization algorithm, which guarantees that the objective function
monotonically decreases at each iteration. Extensive simulation studies showed
that our procedure performed much better than the existing methods, in
particular, when the contamination ratio was large. Two real data analyses were
carried out to illustrate the usefulness of our proposed procedure.Comment: 27 page
Sparse Generalized Eigenvalue Problem: Optimal Statistical Rates via Truncated Rayleigh Flow
Sparse generalized eigenvalue problem (GEP) plays a pivotal role in a large
family of high-dimensional statistical models, including sparse Fisher's
discriminant analysis, canonical correlation analysis, and sufficient dimension
reduction. Sparse GEP involves solving a non-convex optimization problem. Most
existing methods and theory in the context of specific statistical models that
are special cases of the sparse GEP require restrictive structural assumptions
on the input matrices. In this paper, we propose a two-stage computational
framework to solve the sparse GEP. At the first stage, we solve a convex
relaxation of the sparse GEP. Taking the solution as an initial value, we then
exploit a nonconvex optimization perspective and propose the truncated Rayleigh
flow method (Rifle) to estimate the leading generalized eigenvector. We show
that Rifle converges linearly to a solution with the optimal statistical rate
of convergence for many statistical models. Theoretically, our method
significantly improves upon the existing literature by eliminating structural
assumptions on the input matrices for both stages. To achieve this, our
analysis involves two key ingredients: (i) a new analysis of the gradient based
method on nonconvex objective functions, and (ii) a fine-grained
characterization of the evolution of sparsity patterns along the solution path.
Thorough numerical studies are provided to validate the theoretical results.Comment: To appear in JRSS
Post-selection estimation and testing following aggregated association tests
The practice of pooling several individual test statistics to form aggregate
tests is common in many statistical application where individual tests may be
underpowered. While selection by aggregate tests can serve to increase power,
the selection process invalidates the individual test-statistics, making it
difficult to identify the ones that drive the signal in follow-up inference.
Here, we develop a general approach for valid inference following selection by
aggregate testing. We present novel powerful post-selection tests for the
individual null hypotheses which are exact for the normal model and
asymptotically justified otherwise. Our approach relies on the ability to
characterize the distribution of the individual test statistics after
conditioning on the event of selection. We provide efficient algorithms for
estimation of the post-selection maximum-likelihood estimates and suggest
confidence intervals which rely on a novel switching regime for good coverage
guarantees. We validate our methods via comprehensive simulation studies and
apply them to data from the Dallas Heart Study, demonstrating that single
variant association discovery following selection by an aggregated test is
indeed possible in practice.Comment: 33 pages, 9 figure
Projection Algorithms for Non-Convex Minimization with Application to Sparse Principal Component Analysis
We consider concave minimization problems over non-convex sets.Optimization
problems with this structure arise in sparse principal component analysis. We
analyze both a gradient projection algorithm and an approximate Newton
algorithm where the Hessian approximation is a multiple of the identity.
Convergence results are established. In numerical experiments arising in sparse
principal component analysis, it is seen that the performance of the gradient
projection algorithm is very similar to that of the truncated power method and
the generalized power method. In some cases, the approximate Newton algorithm
with a Barzilai-Borwein (BB) Hessian approximation can be substantially faster
than the other algorithms, and can converge to a better solution
Methods for Bayesian Variable Selection with Binary Response Data using the EM Algorithm
High-dimensional Bayesian variable selection problems are often solved using
computationally expensive Markov Chain Montle Carlo (MCMC) techniques.
Recently, a Bayesian variable selection technique was developed for continuous
data using the EM algorithm called EMVS. We extend the EMVS method to binary
data by proposing both a logistic and probit extension. To preserve the
computational speed of EMVS we also implemented the Stochastic Dual Coordinate
Descent (SDCA) algorithm. Further, we conduct two extensive simulation studies
to show the computational speed of both methods. These simulation studies
reveal the power of both methods to quickly identify the correct sparse model.
When these EMVS methods are compared to Stochastic Search Variable Selection
(SSVS), the EMVS methods surpass SSVS both in terms of computational speed and
correctly identifying significant variables. Finally, we illustrate the
effectiveness of both methods on two well-known gene expression datasets. Our
results mirror the results of previous examinations of these datasets with far
less computational cost
Sampling and multilevel coarsening algorithms for fast matrix approximations
This paper addresses matrix approximation problems for matrices that are
large, sparse and/or that are representations of large graphs. To tackle these
problems, we consider algorithms that are based primarily on coarsening
techniques, possibly combined with random sampling. A multilevel coarsening
technique is proposed which utilizes a hypergraph associated with the data
matrix and a graph coarsening strategy based on column matching. Theoretical
results are established that characterize the quality of the dimension
reduction achieved by a coarsening step, when a proper column matching strategy
is employed. We consider a number of standard applications of this technique as
well as a few new ones. Among the standard applications we first consider the
problem of computing the partial SVD for which a combination of sampling and
coarsening yields significantly improved SVD results relative to sampling
alone. We also consider the Column subset selection problem, a popular low rank
approximation method used in data related applications, and show how multilevel
coarsening can be adapted for this problem. Similarly, we consider the problem
of graph sparsification and show how coarsening techniques can be employed to
solve it. Numerical experiments illustrate the performances of the methods in
various applications
Estimation of High-Dimensional Graphical Models Using Regularized Score Matching
Graphical models are widely used to model stochastic dependences among large
collections of variables. We introduce a new method of estimating undirected
conditional independence graphs based on the score matching loss, introduced by
Hyvarinen (2005), and subsequently extended in Hyvarinen (2007). The
regularized score matching method we propose applies to settings with
continuous observations and allows for computationally efficient treatment of
possibly non-Gaussian exponential family models. In the well-explored Gaussian
setting, regularized score matching avoids issues of asymmetry that arise when
applying the technique of neighborhood selection, and compared to existing
methods that directly yield symmetric estimates, the score matching approach
has the advantage that the considered loss is quadratic and gives piecewise
linear solution paths under regularization. Under suitable
irrepresentability conditions, we show that -regularized score matching
is consistent for graph estimation in sparse high-dimensional settings. Through
numerical experiments and an application to RNAseq data, we confirm that
regularized score matching achieves state-of-the-art performance in the
Gaussian case and provides a valuable tool for computationally efficient
estimation in non-Gaussian graphical models
- β¦