3,563 research outputs found
A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS Data
Investigating the pleiotropic effects of genetic variants can increase
statistical power, provide important information to achieve deep understanding
of the complex genetic structures of disease, and offer powerful tools for
designing effective treatments with fewer side effects. However, the current
multiple phenotype association analysis paradigm lacks breadth (number of
phenotypes and genetic variants jointly analyzed at the same time) and depth
(hierarchical structure of phenotype and genotypes). A key issue for high
dimensional pleiotropic analysis is to effectively extract informative internal
representation and features from high dimensional genotype and phenotype data.
To explore multiple levels of representations of genetic variants, learn their
internal patterns involved in the disease development, and overcome critical
barriers in advancing the development of novel statistical methods and
computational algorithms for genetic pleiotropic analysis, we proposed a new
framework referred to as a quadratically regularized functional CCA (QRFCCA)
for association analysis which combines three approaches: (1) quadratically
regularized matrix factorization, (2) functional data analysis and (3)
canonical correlation analysis (CCA). Large-scale simulations show that the
QRFCCA has a much higher power than that of the nine competing statistics while
retaining the appropriate type 1 errors. To further evaluate performance, the
QRFCCA and nine other statistics are applied to the whole genome sequencing
dataset from the TwinsUK study. We identify a total of 79 genes with rare
variants and 67 genes with common variants significantly associated with the 46
traits using QRFCCA. The results show that the QRFCCA substantially outperforms
the nine other statistics.Comment: 64 pages including 12 figure
A Comparison of Relaxations of Multiset Cannonical Correlation Analysis and Applications
Canonical correlation analysis is a statistical technique that is used to
find relations between two sets of variables. An important extension in pattern
analysis is to consider more than two sets of variables. This problem can be
expressed as a quadratically constrained quadratic program (QCQP), commonly
referred to Multi-set Canonical Correlation Analysis (MCCA). This is a
non-convex problem and so greedy algorithms converge to local optima without
any guarantees on global optimality. In this paper, we show that despite being
highly structured, finding the optimal solution is NP-Hard. This motivates our
relaxation of the QCQP to a semidefinite program (SDP). The SDP is convex, can
be solved reasonably efficiently and comes with both absolute and
output-sensitive approximation quality. In addition to theoretical guarantees,
we do an extensive comparison of the QCQP method and the SDP relaxation on a
variety of synthetic and real world data. Finally, we present two useful
extensions: we incorporate kernel methods and computing multiple sets of
canonical vectors
Large-scale Multi-label Learning with Missing Labels
The multi-label classification problem has generated significant interest in
recent years. However, existing approaches do not adequately address two key
challenges: (a) the ability to tackle problems with a large number (say
millions) of labels, and (b) the ability to handle data with missing labels. In
this paper, we directly address both these problems by studying the multi-label
problem in a generic empirical risk minimization (ERM) framework. Our
framework, despite being simple, is surprisingly able to encompass several
recent label-compression based methods which can be derived as special cases of
our method. To optimize the ERM problem, we develop techniques that exploit the
structure of specific loss functions - such as the squared loss function - to
offer efficient algorithms. We further show that our learning framework admits
formal excess risk bounds even in the presence of missing labels. Our risk
bounds are tight and demonstrate better generalization performance for low-rank
promoting trace-norm regularization when compared to (rank insensitive)
Frobenius norm regularization. Finally, we present extensive empirical results
on a variety of benchmark datasets and show that our methods perform
significantly better than existing label compression based methods and can
scale up to very large datasets such as the Wikipedia dataset
Regularization-free estimation in trace regression with symmetric positive semidefinite matrices
Over the past few years, trace regression models have received considerable
attention in the context of matrix completion, quantum state tomography, and
compressed sensing. Estimation of the underlying matrix from
regularization-based approaches promoting low-rankedness, notably nuclear norm
regularization, have enjoyed great popularity. In the present paper, we argue
that such regularization may no longer be necessary if the underlying matrix is
symmetric positive semidefinite (\textsf{spd}) and the design satisfies certain
conditions. In this situation, simple least squares estimation subject to an
\textsf{spd} constraint may perform as well as regularization-based approaches
with a proper choice of the regularization parameter, which entails knowledge
of the noise level and/or tuning. By contrast, constrained least squares
estimation comes without any tuning parameter and may hence be preferred due to
its simplicity
Penalized Orthogonal Iteration for Sparse Estimation of Generalized Eigenvalue Problem
We propose a new algorithm for sparse estimation of eigenvectors in
generalized eigenvalue problems (GEP). The GEP arises in a number of modern
data-analytic situations and statistical methods, including principal component
analysis (PCA), multiclass linear discriminant analysis (LDA), canonical
correlation analysis (CCA), sufficient dimension reduction (SDR) and invariant
co-ordinate selection. We propose to modify the standard generalized orthogonal
iteration with a sparsity-inducing penalty for the eigenvectors. To achieve
this goal, we generalize the equation-solving step of orthogonal iteration to a
penalized convex optimization problem. The resulting algorithm, called
penalized orthogonal iteration, provides accurate estimation of the true
eigenspace, when it is sparse. Also proposed is a computationally more
efficient alternative, which works well for PCA and LDA problems. Numerical
studies reveal that the proposed algorithms are competitive, and that our
tuning procedure works well. We demonstrate applications of the proposed
algorithm to obtain sparse estimates for PCA, multiclass LDA, CCA and SDR.
Supplementary materials are available online
- …