47 research outputs found
Kernel PCA for multivariate extremes
We propose kernel PCA as a method for analyzing the dependence structure of
multivariate extremes and demonstrate that it can be a powerful tool for
clustering and dimension reduction. Our work provides some theoretical insight
into the preimages obtained by kernel PCA, demonstrating that under certain
conditions they can effectively identify clusters in the data. We build on
these new insights to characterize rigorously the performance of kernel PCA
based on an extremal sample, i.e., the angular part of random vectors for which
the radius exceeds a large threshold. More specifically, we focus on the
asymptotic dependence of multivariate extremes characterized by the angular or
spectral measure in extreme value theory and provide a careful analysis in the
case where the extremes are generated from a linear factor model. We give
theoretical guarantees on the performance of kernel PCA preimages of such
extremes by leveraging their asymptotic distribution together with Davis-Kahan
perturbation bounds. Our theoretical findings are complemented with numerical
experiments illustrating the finite sample performance of our methods
Lazy stochastic principal component analysis
Stochastic principal component analysis (SPCA) has become a popular
dimensionality reduction strategy for large, high-dimensional datasets. We
derive a simplified algorithm, called Lazy SPCA, which has reduced
computational complexity and is better suited for large-scale distributed
computation. We prove that SPCA and Lazy SPCA find the same approximations to
the principal subspace, and that the pairwise distances between samples in the
lower-dimensional space is invariant to whether SPCA is executed lazily or not.
Empirical studies find downstream predictive performance to be identical for
both methods, and superior to random projections, across a range of predictive
models (linear regression, logistic lasso, and random forests). In our largest
experiment with 4.6 million samples, Lazy SPCA reduced 43.7 hours of
computation to 9.9 hours. Overall, Lazy SPCA relies exclusively on matrix
multiplications, besides an operation on a small square matrix whose size
depends only on the target dimensionality.Comment: To be published in: 2017 IEEE International Conference on Data Mining
Workshops (ICDMW
Characterizing Pathological Deviations from Normality using Constrained Manifold-Learning
International audienceWe propose a technique to represent a pathological pattern as a deviation from normality along a manifold structure. Each subject is represented by a map of local motion abnormalities, obtained from a statistical atlas of motion built from a healthy population. The algorithm learns a manifold from a set of patients with varying degrees of the same pathology. The approach extends recent manifold-learning techniques by constraining the manifold to pass by a physiologically meaningful origin representing a normal motion pattern. Individuals are compared to the manifold population through a distance that combines a mapping to the manifold and the path along the manifold to reach its origin. The method is applied in the context of cardiac resynchronization therapy (CRT), focusing on a specific motion pattern of intra-ventricular dyssyn-chrony called septal flash (SF). We estimate the manifold from 50 CRT candidates with SF and test it on 38 CRT candidates and 21 healthy volunteers. Experiments highlight the need of nonlinear techniques to learn the studied data, and the relevance of the computed distance for comparing individuals to a specific pathological pattern
Sparse Model Selection using Information Complexity
This dissertation studies and uses the application of information complexity to statistical model selection through three different projects. Specifically, we design statistical models that incorporate sparsity features to make the models more explanatory and computationally efficient.
In the first project, we propose a Sparse Bridge Regression model for variable selection when the number of variables is much greater than the number of observations if model misspecification occurs. The model is demonstrated to have excellent explanatory power in high-dimensional data analysis through numerical simulations and real-world data analysis.
The second project proposes a novel hybrid modeling method that utilizes a mixture of sparse principal component regression (MIX-SPCR) to segment high-dimensional time series data. Using the MIX-SPCR model, we empirically analyze the S\&P 500 index data (from 1999 to 2019) and identify two key change points.
The third project investigates the use of nonlinear features in the Sparse Kernel Factor Analysis (SKFA) method to derive the information criterion. Using a variety of wide datasets, we demonstrate the benefits of SKFA in the nonlinear representation and classification of data. The results obtained show the flexibility and the utility of information complexity in such data modeling problems