13,746 research outputs found
Expectile Matrix Factorization for Skewed Data Analysis
Matrix factorization is a popular approach to solving matrix estimation
problems based on partial observations. Existing matrix factorization is based
on least squares and aims to yield a low-rank matrix to interpret the
conditional sample means given the observations. However, in many real
applications with skewed and extreme data, least squares cannot explain their
central tendency or tail distributions, yielding undesired estimates. In this
paper, we propose \emph{expectile matrix factorization} by introducing
asymmetric least squares, a key concept in expectile regression analysis, into
the matrix factorization framework. We propose an efficient algorithm to solve
the new problem based on alternating minimization and quadratic programming. We
prove that our algorithm converges to a global optimum and exactly recovers the
true underlying low-rank matrices when noise is zero. For synthetic data with
skewed noise and a real-world dataset containing web service response times,
the proposed scheme achieves lower recovery errors than the existing matrix
factorization method based on least squares in a wide range of settings.Comment: 8 page main text with 5 page supplementary documents, published in
AAAI 201
Visualization of Skewed Data: A Tool in R
In this work we present a visualization tool specifically tailored to deal
with skewed data. The technique is based upon the use of two types of notched
boxplots (the usual one, and one which is tuned for the skewness of the data),
the violin plot, the histogram and a nonparametric estimate of the density. The
data is assumed to lie on the same line, so the plots are compatible. We show
that a good deal of information can be extracted from the inspection of this
tool; in particular, we apply the technique to analyze data from synthetic
aperture radar images. We provide the implementation in R.Comment: Submitted to the Revista Colombiana de Estad\'istic
Set Similarity Search for Skewed Data
Set similarity join, as well as the corresponding indexing problem set
similarity search, are fundamental primitives for managing noisy or uncertain
data. For example, these primitives can be used in data cleaning to identify
different representations of the same object. In many cases one can represent
an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries
in such a vector. A set similarity join can then be used to identify those
pairs that have an exceptionally large dot product (or intersection, when
viewed as sets). We choose to focus on identifying vectors with large Pearson
correlation, but results extend to other similarity measures. In particular, we
consider the indexing problem of identifying correlated vectors in a set S of
vectors sampled from {0,1}^d. Given a query vector y and a parameter alpha in
(0,1), we need to search for an alpha-correlated vector x in a data structure
representing the vectors of S. This kind of similarity search has been
intensely studied in worst-case (non-random data) settings.
Existing theoretically well-founded methods for set similarity search are
often inferior to heuristics that take advantage of skew in the data
distribution, i.e., widely differing frequencies of 1s across the d dimensions.
The main contribution of this paper is to analyze the set similarity problem
under a random data model that reflects the kind of skewed data distributions
seen in practice, allowing theoretical results much stronger than what is
possible in worst-case settings. Our indexing data structure is a recursive,
data-dependent partitioning of vectors inspired by recent advances in set
similarity search. Previous data-dependent methods do not seem to allow us to
exploit skew in item frequencies, so we believe that our work sheds further
light on the power of data dependence
Lambert W random variables - a new family of generalized skewed distributions with applications to risk estimation
Originating from a system theory and an input/output point of view, I
introduce a new class of generalized distributions. A parametric nonlinear
transformation converts a random variable into a so-called Lambert
random variable , which allows a very flexible approach to model skewed
data. Its shape depends on the shape of and a skewness parameter .
In particular, for symmetric and nonzero the output is skewed.
Its distribution and density function are particular variants of their input
counterparts. Maximum likelihood and method of moments estimators are
presented, and simulations show that in the symmetric case additional
estimation of does not affect the quality of other parameter
estimates. Applications in finance and biomedicine show the relevance of this
class of distributions, which is particularly useful for slightly skewed data.
A practical by-result of the Lambert framework: data can be "unskewed." The
package http://cran.r-project.org/web/packages/LambertWLambertW developed
by the author is publicly available (http://cran.r-project.orgCRAN).Comment: Published in at http://dx.doi.org/10.1214/11-AOAS457 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
PSO-based method for svm classification on skewed data-sets
Support Vector Machines (SVM) have shown excellent generalization power in classification problems. However, on skewed data-sets, SVM learns a biased model that affects the classifier performance, which is severely damaged when the unbalanced ratio is very large. In this paper, a new external balancing method for applying SVM on skewed data sets is developed. In the first phase of the method, the separating hyperplane is computed. Support vectors are then used to generate the initial population of PSO algorithm, which is used to improve the population of artificial instances and to eliminate noise instances. Experimental results demonstrate the ability of the proposed method to improve the performance of SVM on imbalanced data-sets.Proyecto UAEM 3771/2014/CI
- …