10,982 research outputs found

    A Near-linear Time Approximation Algorithm for Angle-based Outlier Detection in High-dimensional Data

    Get PDF
    Outlier mining in d-dimensional point sets is a fundamental and well studied data mining task due to its variety of ap-plications. Most such applications arise in high-dimensional domains. A bottleneck of existing approaches is that implicit or explicit assessments on concepts of distance or nearest neighbor are deteriorated in high-dimensional data. Follow-ing up on the work of Kriegel et al. (KDD ’08), we inves-tigate the use of angle-based outlier factor in mining high-dimensional outliers. While their algorithm runs in cubic time (with a quadratic time heuristic), we propose a novel random projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, our approach is suitable to be performed in parallel environment to achieve a parallel speedup. We introduce a theoretical analysis of the quality of approximation to guarantee the reliability of our estima-tion algorithm. The empirical experiments on synthetic and real world data sets demonstrate that our approach is effi-cient and scalable to very large high-dimensional data sets

    Robust Subspace Learning: Robust PCA, Robust Subspace Tracking, and Robust Subspace Recovery

    Full text link
    PCA is one of the most widely used dimension reduction techniques. A related easier problem is "subspace learning" or "subspace estimation". Given relatively clean data, both are easily solved via singular value decomposition (SVD). The problem of subspace learning or PCA in the presence of outliers is called robust subspace learning or robust PCA (RPCA). For long data sequences, if one tries to use a single lower dimensional subspace to represent the data, the required subspace dimension may end up being quite large. For such data, a better model is to assume that it lies in a low-dimensional subspace that can change over time, albeit gradually. The problem of tracking such data (and the subspaces) while being robust to outliers is called robust subspace tracking (RST). This article provides a magazine-style overview of the entire field of robust subspace learning and tracking. In particular solutions for three problems are discussed in detail: RPCA via sparse+low-rank matrix decomposition (S+LR), RST via S+LR, and "robust subspace recovery (RSR)". RSR assumes that an entire data vector is either an outlier or an inlier. The S+LR formulation instead assumes that outliers occur on only a few data vector indices and hence are well modeled as sparse corruptions.Comment: To appear, IEEE Signal Processing Magazine, July 201

    Statistical innovations for estimating shape characteristics of biological macromolecules in solution using small-angle x-ray scattering data

    Get PDF
    2016 Spring.Includes bibliographical references.Small-angle X-ray scattering (SAXS) is a technique that yields low-resolution images of biological macromolecules by exposing a solution containing the molecule to a powerful X-ray beam. The beam scatters when it interacts with the molecule. The intensity of the scattered beam is recorded on a detector plate at various scattering angles, and contains information on structural characteristics of the molecule in solution. In particular, the radius of gyration (Rg) for a molecule, which is a measure of the spread of its mass, can be estimated from the lowest scattering angles of SAXS data using a regression technique known as Guinier analysis. The analysis requires specification of a range or “window” of scattering angles over which the regression relationship holds. We have thus developed methodology and supporting asymptotic theory for selection of an optimal window, minimum mean square error estimation of the radius of gyration, and estimation of its variance. The theory and methodology are developed using a local polynomial model with autoregressive errors. Simulation studies confirm the quality of the asymptotic approximations and the superior performance of the proposed methodology relative to the accepted standard. We show that the algorithm is applicable to data acquired from proteins, nucleic acids and their complexes, and we demonstrate with examples that the algorithm improves the ability to test biological hypotheses. The radius of gyration is a normalized second moment of the pairwise distance distribution p(r), which describes the relative frequency of inter-atomic distances in the structure of the molecule. By extending the theory to fourth moments, we show that a new parameter ψ can be calculated theoretically from p(r) and estimated from experimental SAXS data, using a method that extends Guinier's Rg estimation procedure. This new parameter yields an enhanced ability to use intensity data to distinguish between two molecules with different but similar Rg values. Analysis of existing structures in the protein data bank (PDB) shows that the theoretical ψ values relate closely to the aspect ratio of a molecular structure. The combined values for Rg and ψ acquired from experimental data provide estimates for the dimensions and associated uncertainties for a standard geometric shape, representing the particle in solution. We have chosen the cylinder as the standard shape and show that a simple, automated procedure gives a cylindrical estimate of a particle of interest. The cylindrical estimate in turn yields a good first approximation to the maximum inter-atomic distance in a molecule, Dmax, an important parameter in shape reconstruction. As with estimation of Rg, estimation of ψ requires specification of a window of angles over which to conduct the higher-order Guinier analysis. We again employ a local polynomial model with autoregressive errors to derive methodology and supporting asymptotic theory for selection of an optimal window, minimum mean square error estimation of the aspect ratio, and estimation of its variance. Recent advances in SAXS data collection and more comprehensive data comparisons have resulted in a great need for automated scripts that analyze SAXS data. Our procedures to estimate Rg and ψ can be automated easily and can thus be used for large suites of SAXS data under various experimental conditions, in an objective and reproducible manner. The new methods are applied to 357 SAXS intensity curves arising from a study on the wild type nucleosome core particle and its mutants and their behavior under different experimental conditions. The resulting Rg2 values constitute a dataset which is then analyzed to account for the complex dependence structure induced by the experimental protocols. The analysis yields powerful scientific inferences and insight into better design of SAXS experiments. Finally, we consider a measurement error problem relevant to the estimation of the radius of gyration. In a SAXS experiment, it is standard to obtain intensity curves at different concentrations of the molecule in solution. Concentration-by-angle interactions may be present in such data, and analysis is complicated by the fact that actual concentration levels are unknown, but are measured with some error. We therefore propose a model and estimation procedure that allows estimation of true concentration ratios and concentration-by-angle interactions, without requiring any information about concentration other than that contained in the SAXS data
    corecore