59,201 research outputs found

    Randomized Robust Subspace Recovery for High Dimensional Data Matrices

    Full text link
    This paper explores and analyzes two randomized designs for robust Principal Component Analysis (PCA) employing low-dimensional data sketching. In one design, a data sketch is constructed using random column sampling followed by low dimensional embedding, while in the other, sketching is based on random column and row sampling. Both designs are shown to bring about substantial savings in complexity and memory requirements for robust subspace learning over conventional approaches that use the full scale data. A characterization of the sample and computational complexity of both designs is derived in the context of two distinct outlier models, namely, sparse and independent outlier models. The proposed randomized approach can provably recover the correct subspace with computational and sample complexity that are almost independent of the size of the data. The results of the mathematical analysis are confirmed through numerical simulations using both synthetic and real data

    Robust, Scalable, and Provable Approaches to High Dimensional Unsupervised Learning

    Get PDF
    This doctoral thesis focuses on three popular unsupervised learning problems: subspace clustering, robust PCA, and column sampling. For the subspace clustering problem, a new transformative idea is presented. The proposed approach, termed Innovation Pursuit, is a new geometrical solution to the subspace clustering problem whereby subspaces are identified based on their relative novelties. A detailed mathematical analysis is provided establishing sufficient conditions for the proposed method to correctly cluster the data points. The numerical simulations with both real and synthetic data demonstrate that Innovation Pursuit notably outperforms the state-of-the-art subspace clustering algorithms. For the robust PCA problem, we focus on both the outlier detection and the matrix decomposition problems. For the outlier detection problem, we present a new algorithm, termed Coherence Pursuit, in addition to two scalable randomized frameworks for the implementation of outlier detection algorithms. The Coherence Pursuit method is the first provable and non-iterative robust PCA method which is provably robust to both unstructured and structured outliers. Coherence Pursuit is remarkably simple and it notably outperforms the existing methods in dealing with structured outliers. In the proposed randomized designs, we leverage the low dimensional structure of the low rank component to apply the robust PCA algorithm to a random sketch of the data as opposed to the full scale data. Importantly, it is analytically shown that the presented randomized designs can make the computation or sample complexity of the low rank matrix recovery algorithm independent of the size of the data. At the end, we focus on the column sampling problem. A new sampling tool, dubbed Spatial Random Sampling, is presented which performs the random sampling in the spatial domain. The most compelling feature of Spatial Random Sampling is that it is the first unsupervised column sampling method which preserves the spatial distribution of the data

    Fast Robust PCA on Graphs

    Get PDF
    Mining useful clusters from high dimensional data has received significant attention of the computer vision and pattern recognition community in the recent years. Linear and non-linear dimensionality reduction has played an important role to overcome the curse of dimensionality. However, often such methods are accompanied with three different problems: high computational complexity (usually associated with the nuclear norm minimization), non-convexity (for matrix factorization methods) and susceptibility to gross corruptions in the data. In this paper we propose a principal component analysis (PCA) based solution that overcomes these three issues and approximates a low-rank recovery method for high dimensional datasets. We target the low-rank recovery by enforcing two types of graph smoothness assumptions, one on the data samples and the other on the features by designing a convex optimization problem. The resulting algorithm is fast, efficient and scalable for huge datasets with O(nlog(n)) computational complexity in the number of data samples. It is also robust to gross corruptions in the dataset as well as to the model parameters. Clustering experiments on 7 benchmark datasets with different types of corruptions and background separation experiments on 3 video datasets show that our proposed model outperforms 10 state-of-the-art dimensionality reduction models. Our theoretical analysis proves that the proposed model is able to recover approximate low-rank representations with a bounded error for clusterable data

    Robust principal component analysis in water quality index development

    Get PDF
    Some statistical procedures already available in literature are employed in developing the water quality index, WQI. The nature of complexity and interdependency that occur in physical and chemical processes of water could be easier explained if statistical approaches were applied to water quality indexing. The most popular statistical method used in developing WQI is the principal component analysis (PCA). In literature, the WQI development based on the classical PCA mostly used water quality data that have been transformed and normalized. Outliers may be considered in or eliminated from the analysis. However, the classical mean and sample covariance matrix used in classical PCA methodology is not reliable if the outliers exist in the data. Since the presence of outliers may affect the computation of the principal component, robust principal component analysis, RPCA should be used. Focusing in Langat River, the RPCA-WQI was introduced for the first time in this study to re-calculate the DOE-WQI. Results show that the RPCA-WQI is capable to capture similar distribution in the existing DOE-WQI
    corecore