Search CORE

13,746 research outputs found

Expectile Matrix Factorization for Skewed Data Analysis

Author: Kong Linglong
Li Zongpeng
Niu Di
Zhu Rui
Publication venue
Publication date: 10/02/2017
Field of study

Matrix factorization is a popular approach to solving matrix estimation problems based on partial observations. Existing matrix factorization is based on least squares and aims to yield a low-rank matrix to interpret the conditional sample means given the observations. However, in many real applications with skewed and extreme data, least squares cannot explain their central tendency or tail distributions, yielding undesired estimates. In this paper, we propose \emph{expectile matrix factorization} by introducing asymmetric least squares, a key concept in expectile regression analysis, into the matrix factorization framework. We propose an efficient algorithm to solve the new problem based on alternating minimization and quadratic programming. We prove that our algorithm converges to a global optimum and exactly recovers the true underlying low-rank matrices when noise is zero. For synthetic data with skewed noise and a real-world dataset containing web service response times, the proposed scheme achieves lower recovery errors than the existing matrix factorization method based on least squares in a wide range of settings.Comment: 8 page main text with 5 page supplementary documents, published in AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Visualization of Skewed Data: A Tool in R

Author: Frery A. C.
Larangeiras A. M.
Ospina R.
Publication venue
Publication date: 03/03/2014
Field of study

In this work we present a visualization tool specifically tailored to deal with skewed data. The technique is based upon the use of two types of notched boxplots (the usual one, and one which is tuned for the skewness of the data), the violin plot, the histogram and a nonparametric estimate of the density. The data is assumed to lie on the same line, so the plots are compatible. We show that a good deal of information can be extracted from the inspection of this tool; in particular, we apply the technique to analyze data from synthetic aperture radar images. We provide the implementation in R.Comment: Submitted to the Revista Colombiana de Estad\'istic

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Nacional De Colombia - Repositorio Institucional UN

Data Mining with Skewed Data

Author: Alair Pereira do Lago
Jorn Mehnen
Manoel Fernando Alonso Gadi
Publication venue: 'IntechOpen'
Publication date: 01/02/2010
Field of study

IntechOpen

Set Similarity Search for Skewed Data

Author: Broder A. Z.
Choi S.
Dasgupta S.
Karppa M.
Shrivastava A.
Valiant L. G.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from {0,1}^d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alpha-correlated vector x in a data structure representing the vectors of S. This kind of similarity search has been intensely studied in worst-case (non-random data) settings. Existing theoretically well-founded methods for set similarity search are often inferior to heuristics that take advantage of skew in the data distribution, i.e., widely differing frequencies of 1s across the d dimensions. The main contribution of this paper is to analyze the set similarity problem under a random data model that reflects the kind of skewed data distributions seen in practice, allowing theoretical results much stronger than what is possible in worst-case settings. Our indexing data structure is a recursive, data-dependent partitioning of vectors inspired by recent advances in set similarity search. Previous data-dependent methods do not seem to allow us to exploit skew in item frequencies, so we believe that our work sheds further light on the power of data dependence

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository

Lambert W random variables - a new family of generalized skewed distributions with applications to risk estimation

Author: Goerg Georg M.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 25/11/2011
Field of study

Originating from a system theory and an input/output point of view, I introduce a new class of generalized distributions. A parametric nonlinear transformation converts a random variable

X

into a so-called Lambert

W

random variable

Y

, which allows a very flexible approach to model skewed data. Its shape depends on the shape of

X

and a skewness parameter

\gamma

. In particular, for symmetric

X

and nonzero

\gamma

the output

Y

is skewed. Its distribution and density function are particular variants of their input counterparts. Maximum likelihood and method of moments estimators are presented, and simulations show that in the symmetric case additional estimation of

\gamma

does not affect the quality of other parameter estimates. Applications in finance and biomedicine show the relevance of this class of distributions, which is particularly useful for slightly skewed data. A practical by-result of the Lambert

W

framework: data can be "unskewed." The

R

package http://cran.r-project.org/web/packages/LambertWLambertW developed by the author is publicly available (http://cran.r-project.orgCRAN).Comment: Published in at http://dx.doi.org/10.1214/11-AOAS457 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

PSO-based method for svm classification on skewed data-sets

Author: Adrián Trueba Espinosa /
Cervantes Jair
Cervantes Jair
Cervantes Jair
García Lamont Farid
García Lamont Farid
García Lamont Farid
Lopez Chau Asdrubal /
LOPEZ CHAU ASDRUBAL
LOPEZ CHAU ASDRUBAL
Rodríguez Mazahua Lisbeth
Rodríguez Mazahua Lisbeth
Rodríguez Lisbeth
RUIZ CASTILLA JOSE SERGIO
Ruiz Castilla Jose Sergio
RUIZ CASTILLA JOSE SERGIO
Trueba Espinosa Adrián
Trueba Espinosa Adrián
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2016
Field of study

Support Vector Machines (SVM) have shown excellent generalization power in classification problems. However, on skewed data-sets, SVM learns a biased model that affects the classifier performance, which is severely damaged when the unbalanced ratio is very large. In this paper, a new external balancing method for applying SVM on skewed data sets is developed. In the first phase of the method, the separating hyperplane is computed. Support vectors are then used to generate the initial population of PSO algorithm, which is used to improve the population of artificial instances and to eliminate noise instances. Experimental results demonstrate the ability of the proposed method to improve the performance of SVM on imbalanced data-sets.Proyecto UAEM 3771/2014/CI

Crossref

Red Mexicana de Repositorios Institucionales

Repositorio Institucional de la Universidad Autónoma del Estado de México