47 research outputs found

    Kernel PCA for multivariate extremes

    Full text link
    We propose kernel PCA as a method for analyzing the dependence structure of multivariate extremes and demonstrate that it can be a powerful tool for clustering and dimension reduction. Our work provides some theoretical insight into the preimages obtained by kernel PCA, demonstrating that under certain conditions they can effectively identify clusters in the data. We build on these new insights to characterize rigorously the performance of kernel PCA based on an extremal sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold. More specifically, we focus on the asymptotic dependence of multivariate extremes characterized by the angular or spectral measure in extreme value theory and provide a careful analysis in the case where the extremes are generated from a linear factor model. We give theoretical guarantees on the performance of kernel PCA preimages of such extremes by leveraging their asymptotic distribution together with Davis-Kahan perturbation bounds. Our theoretical findings are complemented with numerical experiments illustrating the finite sample performance of our methods

    Lazy stochastic principal component analysis

    Full text link
    Stochastic principal component analysis (SPCA) has become a popular dimensionality reduction strategy for large, high-dimensional datasets. We derive a simplified algorithm, called Lazy SPCA, which has reduced computational complexity and is better suited for large-scale distributed computation. We prove that SPCA and Lazy SPCA find the same approximations to the principal subspace, and that the pairwise distances between samples in the lower-dimensional space is invariant to whether SPCA is executed lazily or not. Empirical studies find downstream predictive performance to be identical for both methods, and superior to random projections, across a range of predictive models (linear regression, logistic lasso, and random forests). In our largest experiment with 4.6 million samples, Lazy SPCA reduced 43.7 hours of computation to 9.9 hours. Overall, Lazy SPCA relies exclusively on matrix multiplications, besides an operation on a small square matrix whose size depends only on the target dimensionality.Comment: To be published in: 2017 IEEE International Conference on Data Mining Workshops (ICDMW

    Characterizing Pathological Deviations from Normality using Constrained Manifold-Learning

    Get PDF
    International audienceWe propose a technique to represent a pathological pattern as a deviation from normality along a manifold structure. Each subject is represented by a map of local motion abnormalities, obtained from a statistical atlas of motion built from a healthy population. The algorithm learns a manifold from a set of patients with varying degrees of the same pathology. The approach extends recent manifold-learning techniques by constraining the manifold to pass by a physiologically meaningful origin representing a normal motion pattern. Individuals are compared to the manifold population through a distance that combines a mapping to the manifold and the path along the manifold to reach its origin. The method is applied in the context of cardiac resynchronization therapy (CRT), focusing on a specific motion pattern of intra-ventricular dyssyn-chrony called septal flash (SF). We estimate the manifold from 50 CRT candidates with SF and test it on 38 CRT candidates and 21 healthy volunteers. Experiments highlight the need of nonlinear techniques to learn the studied data, and the relevance of the computed distance for comparing individuals to a specific pathological pattern

    Sparse Model Selection using Information Complexity

    Get PDF
    This dissertation studies and uses the application of information complexity to statistical model selection through three different projects. Specifically, we design statistical models that incorporate sparsity features to make the models more explanatory and computationally efficient. In the first project, we propose a Sparse Bridge Regression model for variable selection when the number of variables is much greater than the number of observations if model misspecification occurs. The model is demonstrated to have excellent explanatory power in high-dimensional data analysis through numerical simulations and real-world data analysis. The second project proposes a novel hybrid modeling method that utilizes a mixture of sparse principal component regression (MIX-SPCR) to segment high-dimensional time series data. Using the MIX-SPCR model, we empirically analyze the S\&P 500 index data (from 1999 to 2019) and identify two key change points. The third project investigates the use of nonlinear features in the Sparse Kernel Factor Analysis (SKFA) method to derive the information criterion. Using a variety of wide datasets, we demonstrate the benefits of SKFA in the nonlinear representation and classification of data. The results obtained show the flexibility and the utility of information complexity in such data modeling problems

    Kernel Methods for Machine Learning with Life Science Applications

    Get PDF
    corecore