11 research outputs found

    Low Rank Approximation in the Presence of Outliers

    Get PDF
    We consider the problem of principal component analysis (PCA) in the presence of outliers. Given a matrix A (d x n) and parameters k, m, the goal is to remove a set of at most m columns of A (outliers), so as to minimize the rank-k approximation error of the remaining matrix (inliers). While much of the work on this problem has focused on recovery of the rank-k subspace under assumptions on the inliers and outliers, we focus on the approximation problem. Our main result shows that sampling-based methods developed in the outlier-free case give non-trivial guarantees even in the presence of outliers. Using this insight, we develop a simple algorithm that has bi-criteria guarantees. Further, unlike similar formulations for clustering, we show that bi-criteria guarantees are unavoidable for the problem, under appropriate complexity assumptions

    Robust computation of linear models by convex relaxation

    Get PDF
    Consider a dataset of vector-valued observations that consists of noisy inliers, which are explained well by a low-dimensional subspace, along with some number of outliers. This work describes a convex optimization problem, called REAPER, that can reliably fit a low-dimensional model to this type of data. This approach parameterizes linear subspaces using orthogonal projectors, and it uses a relaxation of the set of orthogonal projectors to reach the convex formulation. The paper provides an efficient algorithm for solving the REAPER problem, and it documents numerical experiments which confirm that REAPER can dependably find linear structure in synthetic and natural data. In addition, when the inliers lie near a low-dimensional subspace, there is a rigorous theory that describes when REAPER can approximate this subspace.Comment: Formerly titled "Robust computation of linear models, or How to find a needle in a haystack

    Automatic and adaptive preprocessing for the development of predictive models.

    Get PDF
    In recent years, there has been an increasing interest in extracting valuable information from large amounts of data. This information can be useful for making predictions about the future or inferring unknown values. There exists a multitude of predictive models for the most common tasks of classification and regression. However, researchers often assume that data is clean and far too little attention has been paid to data pre-processing. Despite the fact that there are a number of methods for accomplishing individual pre-processing tasks (e.g. outlier detection or feature selection), the effort of performing comprehensive data preparation and cleaning can take between 60% and 80% of the whole data mining process time. One of the goals of this research is to speed up this process and make it more efficient. To this end, an approach for automating the selection and optimisation of multiple preprocessing methods and predictors has been proposed. The combination of multiple data mining methods forming a workflow is known as Multi-Component Predictive System (MCPS). There are multiple software platforms like Weka and RapidMiner to create and run MCPSs including a large variety of pre-processing methods and predictors. There is, however, no common mathematical representation of MCPSs. An objective of this thesis is to establish a common representation framework of MCPSs. This will allow validating workflows before beginning the implementation phase with any particular platform. The validation of workflows becomes even more relevant when considering the automatic generation of MCPSs. In order to automate the composition and optimisation of MCPSs, a search space is defined consisting of a number of preprocessing methods, predictive models and their hyperparameters. Then, the space is explored using a Bayesian optimisation strategy within a given time or computational budget. As a result, a parametrised sequence of methods is returned which after training form a complete predictive system. The whole process is data-driven and does not require human intervention once it has been started. The generated predictive system can then be used to make predictions in an online scenario. However, it is possible that the nature of the input data changes over time. As a result, predictive models may need to be updated to capture the new characteristics of the data in order to reduce the loss of predictive performance. Similarly, preprocessing methods may have to be adapted as well. A novel hybrid strategy combining Bayesian optimisation and common adaptive techniques is proposed to automatically adapt MCPSs. This approach performs a global adaptation of the MCPS. However, in some situations, it could be costly to update the whole predictive system when maybe just a little adjustment is needed. The consequences of adapting a single component can, however, be significant. This thesis also analyses the impact of adapting individual components in an MCPS and proposes an approach to propagate changes through the system. This thesis was initiated due to a joint research project with a chemical production company, which has provided several datasets with common raw data issues in the process industry. The final part of this thesis evaluates the feasibility of applying such automatic techniques for building and maintaining predictive models for real chemical production processes
    corecore