13,746 research outputs found

    Expectile Matrix Factorization for Skewed Data Analysis

    Full text link
    Matrix factorization is a popular approach to solving matrix estimation problems based on partial observations. Existing matrix factorization is based on least squares and aims to yield a low-rank matrix to interpret the conditional sample means given the observations. However, in many real applications with skewed and extreme data, least squares cannot explain their central tendency or tail distributions, yielding undesired estimates. In this paper, we propose \emph{expectile matrix factorization} by introducing asymmetric least squares, a key concept in expectile regression analysis, into the matrix factorization framework. We propose an efficient algorithm to solve the new problem based on alternating minimization and quadratic programming. We prove that our algorithm converges to a global optimum and exactly recovers the true underlying low-rank matrices when noise is zero. For synthetic data with skewed noise and a real-world dataset containing web service response times, the proposed scheme achieves lower recovery errors than the existing matrix factorization method based on least squares in a wide range of settings.Comment: 8 page main text with 5 page supplementary documents, published in AAAI 201

    Visualization of Skewed Data: A Tool in R

    Get PDF
    In this work we present a visualization tool specifically tailored to deal with skewed data. The technique is based upon the use of two types of notched boxplots (the usual one, and one which is tuned for the skewness of the data), the violin plot, the histogram and a nonparametric estimate of the density. The data is assumed to lie on the same line, so the plots are compatible. We show that a good deal of information can be extracted from the inspection of this tool; in particular, we apply the technique to analyze data from synthetic aperture radar images. We provide the implementation in R.Comment: Submitted to the Revista Colombiana de Estad\'istic

    Data Mining with Skewed Data

    Get PDF

    Set Similarity Search for Skewed Data

    Get PDF
    Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from {0,1}^d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alpha-correlated vector x in a data structure representing the vectors of S. This kind of similarity search has been intensely studied in worst-case (non-random data) settings. Existing theoretically well-founded methods for set similarity search are often inferior to heuristics that take advantage of skew in the data distribution, i.e., widely differing frequencies of 1s across the d dimensions. The main contribution of this paper is to analyze the set similarity problem under a random data model that reflects the kind of skewed data distributions seen in practice, allowing theoretical results much stronger than what is possible in worst-case settings. Our indexing data structure is a recursive, data-dependent partitioning of vectors inspired by recent advances in set similarity search. Previous data-dependent methods do not seem to allow us to exploit skew in item frequencies, so we believe that our work sheds further light on the power of data dependence

    Lambert W random variables - a new family of generalized skewed distributions with applications to risk estimation

    Full text link
    Originating from a system theory and an input/output point of view, I introduce a new class of generalized distributions. A parametric nonlinear transformation converts a random variable XX into a so-called Lambert WW random variable YY, which allows a very flexible approach to model skewed data. Its shape depends on the shape of XX and a skewness parameter γ\gamma. In particular, for symmetric XX and nonzero γ\gamma the output YY is skewed. Its distribution and density function are particular variants of their input counterparts. Maximum likelihood and method of moments estimators are presented, and simulations show that in the symmetric case additional estimation of γ\gamma does not affect the quality of other parameter estimates. Applications in finance and biomedicine show the relevance of this class of distributions, which is particularly useful for slightly skewed data. A practical by-result of the Lambert WW framework: data can be "unskewed." The RR package http://cran.r-project.org/web/packages/LambertWLambertW developed by the author is publicly available (http://cran.r-project.orgCRAN).Comment: Published in at http://dx.doi.org/10.1214/11-AOAS457 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    PSO-based method for svm classification on skewed data-sets

    Get PDF
    Support Vector Machines (SVM) have shown excellent generalization power in classification problems. However, on skewed data-sets, SVM learns a biased model that affects the classifier performance, which is severely damaged when the unbalanced ratio is very large. In this paper, a new external balancing method for applying SVM on skewed data sets is developed. In the first phase of the method, the separating hyperplane is computed. Support vectors are then used to generate the initial population of PSO algorithm, which is used to improve the population of artificial instances and to eliminate noise instances. Experimental results demonstrate the ability of the proposed method to improve the performance of SVM on imbalanced data-sets.Proyecto UAEM 3771/2014/CI
    corecore