Canonical correlation analysis is a widely used multivariate statistical
technique for exploring the relation between two sets of variables. This paper
considers the problem of estimating the leading canonical correlation
directions in high-dimensional settings. Recently, under the assumption that
the leading canonical correlation directions are sparse, various procedures
have been proposed for many high-dimensional applications involving massive
data sets. However, there has been few theoretical justification available in
the literature. In this paper, we establish rate-optimal nonasymptotic minimax
estimation with respect to an appropriate loss function for a wide range of
model spaces. Two interesting phenomena are observed. First, the minimax rates
are not affected by the presence of nuisance parameters, namely the covariance
matrices of the two sets of random variables, though they need to be estimated
in the canonical correlation analysis problem. Second, we allow the presence of
the residual canonical correlation directions. However, they do not influence
the minimax rates under a mild condition on eigengap. A generalized sin-theta
theorem and an empirical process bound for Gaussian quadratic forms under rank
constraint are used to establish the minimax upper bounds, which may be of
independent interest.Comment: Published at http://dx.doi.org/10.1214/15-AOS1332 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org