27,976 research outputs found
Low-Rank Boolean Matrix Approximation by Integer Programming
Low-rank approximations of data matrices are an important dimensionality
reduction tool in machine learning and regression analysis. We consider the
case of categorical variables, where it can be formulated as the problem of
finding low-rank approximations to Boolean matrices. In this paper we give what
is to the best of our knowledge the first integer programming formulation that
relies on only polynomially many variables and constraints, we discuss how to
solve it computationally and report numerical tests on synthetic and real-world
data
Algorithms for Approximate Subtropical Matrix Factorization
Matrix factorization methods are important tools in data mining and analysis.
They can be used for many tasks, ranging from dimensionality reduction to
visualization. In this paper we concentrate on the use of matrix factorizations
for finding patterns from the data. Rather than using the standard algebra --
and the summation of the rank-1 components to build the approximation of the
original matrix -- we use the subtropical algebra, which is an algebra over the
nonnegative real values with the summation replaced by the maximum operator.
Subtropical matrix factorizations allow "winner-takes-it-all" interpretations
of the rank-1 components, revealing different structure than the normal
(nonnegative) factorizations. We study the complexity and sparsity of the
factorizations, and present a framework for finding low-rank subtropical
factorizations. We present two specific algorithms, called Capricorn and
Cancer, that are part of our framework. They can be used with data that has
been corrupted with different types of noise, and with different error metrics,
including the sum-of-absolute differences, Frobenius norm, and Jensen--Shannon
divergence. Our experiments show that the algorithms perform well on data that
has subtropical structure, and that they can find factorizations that are both
sparse and easy to interpret.Comment: 40 pages, 9 figures. For the associated source code, see
http://people.mpi-inf.mpg.de/~pmiettin/tropical
Data-driven pattern identification and outlier detection in time series
We address the problem of data-driven pattern identification and outlier
detection in time series. To this end, we use singular value decomposition
(SVD) which is a well-known technique to compute a low-rank approximation for
an arbitrary matrix. By recasting the time series as a matrix it becomes
possible to use SVD to highlight the underlying patterns and periodicities.
This is done without the need for specifying user-defined parameters. From a
data mining perspective, this opens up new ways of analyzing time series in a
data-driven, bottom-up fashion. However, in order to get correct results, it is
important to understand how the SVD-spectrum of a time series is influenced by
various characteristics of the underlying signal and noise. In this paper, we
have extended the work in earlier papers by initiating a more systematic
analysis of these effects. We then illustrate our findings on some real-life
data
Data Cube Approximation and Mining using Probabilistic Modeling
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data.
Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be
used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches
- …