59 research outputs found
Sparse CCA: Adaptive Estimation and Computational Barriers
Canonical correlation analysis is a classical technique for exploring the
relationship between two sets of variables. It has important applications in
analyzing high dimensional datasets originated from genomics, imaging and other
fields. This paper considers adaptive minimax and computationally tractable
estimation of leading sparse canonical coefficient vectors in high dimensions.
First, we establish separate minimax estimation rates for canonical coefficient
vectors of each set of random variables under no structural assumption on
marginal covariance matrices. Second, we propose a computationally feasible
estimator to attain the optimal rates adaptively under an additional sample
size condition. Finally, we show that a sample size condition of this kind is
needed for any randomized polynomial-time estimator to be consistent, assuming
hardness of certain instances of the Planted Clique detection problem. The
result is faithful to the Gaussian models used in the paper. As a byproduct, we
obtain the first computational lower bounds for sparse PCA under the Gaussian
single spiked covariance model
Sparse GCA and Thresholded Gradient Descent
Generalized correlation analysis (GCA) is concerned with uncovering linear
relationships across multiple datasets. It generalizes canonical correlation
analysis that is designed for two datasets. We study sparse GCA when there are
potentially multiple generalized correlation tuples in data and the loading
matrix has a small number of nonzero rows. It includes sparse CCA and sparse
PCA of correlation matrices as special cases. We first formulate sparse GCA as
generalized eigenvalue problems at both population and sample levels via a
careful choice of normalization constraints. Based on a Lagrangian form of the
sample optimization problem, we propose a thresholded gradient descent
algorithm for estimating GCA loading vectors and matrices in high dimensions.
We derive tight estimation error bounds for estimators generated by the
algorithm with proper initialization. We also demonstrate the prowess of the
algorithm on a number of synthetic datasets
- …