155 research outputs found
Data Cube Approximation and Mining using Probabilistic Modeling
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data.
Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be
used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches
Diamond Dicing
In OLAP, analysts often select an interesting sample of the data. For
example, an analyst might focus on products bringing revenues of at least 100
000 dollars, or on shops having sales greater than 400 000 dollars. However,
current systems do not allow the application of both of these thresholds
simultaneously, selecting products and shops satisfying both thresholds. For
such purposes, we introduce the diamond cube operator, filling a gap among
existing data warehouse operations.
Because of the interaction between dimensions the computation of diamond
cubes is challenging. We compare and test various algorithms on large data sets
of more than 100 million facts. We find that while it is possible to implement
diamonds in SQL, it is inefficient. Indeed, our custom implementation can be a
hundred times faster than popular database engines (including a row-store and a
column-store).Comment: 29 page
Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability
Understanding the shopping motivations behind market baskets has high
commercial value in the grocery retail industry. Analyzing shopping
transactions demands techniques that can cope with the volume and
dimensionality of grocery transactional data while keeping interpretable
outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to
process grocery transactions and to discover a broad representation of
customers' shopping motivations. However, summarizing the posterior
distribution of an LDA model is challenging, while individual LDA draws may not
be coherent and cannot capture topic uncertainty. Moreover, the evaluation of
LDA models is dominated by model-fit measures which may not adequately capture
the qualitative aspects such as interpretability and stability of topics.
In this paper, we introduce clustering methodology that post-processes
posterior LDA draws to summarise the entire posterior distribution and identify
semantic modes represented as recurrent topics. Our approach is an alternative
to standard label-switching techniques and provides a single posterior summary
set of topics, as well as associated measures of uncertainty. Furthermore, we
establish a more holistic definition for model evaluation, which assesses topic
models based not only on their likelihood but also on their coherence,
distinctiveness and stability. By means of a survey, we set thresholds for the
interpretation of topic coherence and topic similarity in the domain of grocery
retail data. We demonstrate that the selection of recurrent topics through our
clustering methodology not only improves model likelihood but also outperforms
the qualitative aspects of LDA such as interpretability and stability. We
illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure
Dwarf: A Complete System for Analyzing High-Dimensional Data Sets
The need for data analysis by different industries, including
telecommunications, retail, manufacturing and financial services, has
generated a flurry of research, highly sophisticated methods and
commercial products. However, all of the current attempts are haunted
by the so-called "high-dimensionality curse"; the complexity of space
and time increases exponentially with the number of analysis
"dimensions". This means that all existing approaches are limited
only to coarse levels of analysis and/or to approximate answers with
reduced precision. As the need for detailed analysis keeps
increasing, along with the volume and the detail of the data that is
stored, these approaches are very quickly rendered unusable. I have
developed a unique method for efficiently performing analysis that is
not affected by the high-dimensionality of data and scales only
polynomially -and almost linearly- with the dimensions without
sacrificing any accuracy in the returned results. I have implemented a
complete system (called "Dwarf") and performed an extensive
experimental evaluation that demonstrated tremendous improvements over
existing methods for all aspects of performing analysis -initial
computation, storing, querying and updating it.
I have extended my research to the "data-streaming" model where
updates are performed on-line, exacerbating any concurrent analysis
but has a very high impact on applications like security, network
management/monitoring router traffic control and sensor networks. I
have devised streaming algorithms that provide complex statistics
within user-specified relative-error bounds over a data stream. I
introduced the class of "distinct implicated statistics", which is
much more general than the established class of "distinct count"
statistics. The latter has been proved invaluable in applications such
as analyzing and monitoring the distinct count of species in a
population or even in query optimization. The "distinct implicated
statistics" class provides invaluable information about the
correlations in the stream and is necessary for applications such as
security. My algorithms are designed to use bounded amounts of memory
and processing -so that they can even be implemented in hardware for
resource-limited environments such as network-routers or sensors- and
also to work in "noisy" environments, where some data may be flawed
either implicitly due to the extraction process or explicitly
The Dwarf Data Cube Eliminates the Highy Dimensionality Curse
The data cube operator encapsulates all possible groupings of a
data set and has proved to be an invaluable tool in analyzing vast amounts
of data. However its apparent exponential complexity has significantly
limited its applicability to low dimensional datasets. Recently the idea
of the dwarf data cube model was introduced, and showed that
high-dimensional ``dwarf data cubes'' are orders of magnitudes smaller in
size than the original data cubes even when they calculate and store every
possible aggregation with 100\% precision.
In this paper we present a surprising analytical result proving
that the size of dwarf cubes grows polynomially with the
dimensionality of the data set and, therefore, a full data cube at 100%
precision is not inherently cursed by high dimensionality. This striking
result of polynomial complexity reformulates the context of cube
management and redefines most of the problems associated with
data-warehousing and On-Line Analytical Processing. We also develop an
efficient algorithm for estimating the size of dwarf data cubes before
actually computing them. Finally, we complement our analytical approach
with an experimental evaluation using real and synthetic data sets, and
demonstrate our results.
UMIACS-TR-2003-12
Entropies from coarse-graining: convex polytopes vs. ellipsoids
We examine the Boltzmann/Gibbs/Shannon and the
non-additive Havrda-Charv\'{a}t / Dar\'{o}czy/Cressie-Read/Tsallis \
\ and the Kaniadakis -entropy \ \
from the viewpoint of coarse-graining, symplectic capacities and convexity. We
argue that the functional form of such entropies can be ascribed to a
discordance in phase-space coarse-graining between two generally different
approaches: the Euclidean/Riemannian metric one that reflects independence and
picks cubes as the fundamental cells and the symplectic/canonical one that
picks spheres/ellipsoids for this role. Our discussion is motivated by and
confined to the behaviour of Hamiltonian systems of many degrees of freedom. We
see that Dvoretzky's theorem provides asymptotic estimates for the minimal
dimension beyond which these two approaches are close to each other. We state
and speculate about the role that dualities may play in this viewpoint.Comment: 63 pages. No figures. Standard LaTe
- …