115 research outputs found
Maximum Entropy Based Significance of Itemsets
We consider the problem of defining the significance of an itemset. We say
that the itemset is significant if we are surprised by its frequency when
compared to the frequencies of its sub-itemsets. In other words, we estimate
the frequency of the itemset from the frequencies of its sub-itemsets and
compute the deviation between the real value and the estimate. For the
estimation we use Maximum Entropy and for measuring the deviation we use
Kullback-Leibler divergence.
A major advantage compared to the previous methods is that we are able to use
richer models whereas the previous approaches only measure the deviation from
the independence model.
We show that our measure of significance goes to zero for derivable itemsets
and that we can use the rank as a statistical test. Our empirical results
demonstrate that for our real datasets the independence assumption is too
strong but applying more flexible models leads to good results.Comment: Journal version. The previous version is the conference pape
Ranking Episodes using a Partition Model
One of the biggest setbacks in traditional frequent pattern mining is that
overwhelmingly many of the discovered patterns are redundant. A prototypical
example of such redundancy is a freerider pattern where the pattern contains a
true pattern and some additional noise events. A technique for filtering
freerider patterns that has proved to be efficient in ranking itemsets is to
use a partition model where a pattern is divided into two subpatterns and the
observed support is compared to the expected support under the assumption that
these two subpatterns occur independently.
In this paper we develop a partition model for episodes, patterns discovered
from sequential data. An episode is essentially a set of events, with possible
restrictions on the order of events. Unlike with itemset mining, computing the
expected support of an episode requires surprisingly sophisticated methods. In
order to construct the model, we partition the episode into two subepisodes. We
then model how likely the events in each subepisode occur close to each other.
If this probability is high---which is often the case if the subepisode has a
high support---then we can expect that when one event from a subepisode occurs,
then the remaining events occur also close by. This approach increases the
expected support of the episode, and if this increase explains the observed
support, then we can deem the episode uninteresting. We demonstrate in our
experiments that using the partition model can effectively and efficiently
reduce the redundancy in episodes
Safe projections of binary data sets
Selectivity estimation of a boolean query based on frequent itemsets can be
solved by describing the problem by a linear program. However, the number of
variables in the equations is exponential, rendering the approach tractable
only for small-dimensional cases. One natural approach would be to project the
data to the variables occurring in the query. This can, however, change the
outcome of the linear program.
We introduce the concept of safe sets: projecting the data to a safe set does
not change the outcome of the linear program. We characterise safe sets using
graph theoretic concepts and give an algorithm for finding minimal safe sets
containing given attributes. We describe a heuristic algorithm for finding
almost-safe sets given a size restriction, and show empirically that these sets
outperform the trivial projection.
We also show a connection between safe sets and Markov Random Fields and use
it to further reduce the number of variables in the linear program, given some
regularity assumptions on the frequent itemsets
Discovering bursts revisited: guaranteed optimization of the model parameters
One of the classic data mining tasks is to discover bursts, time intervals,
where events occur at abnormally high rate. In this paper we revisit
Kleinberg's seminal work, where bursts are discovered by using exponential
distribution with a varying rate parameter: the regions where it is more
advantageous to set the rate higher are deemed bursty. The model depends on two
parameters, the initial rate and the change rate. The initial rate, that is,
the rate that is used when there are no burstiness was set to the average rate
over the whole sequence. The change rate is provided by the user.
We argue that these choices are suboptimal: it leads to worse likelihood, and
may lead to missing some existing bursts. We propose an alternative problem
setting, where the model parameters are selected by optimizing the likelihood
of the model. While this tweak is trivial from the problem definition point of
view, this changes the optimization problem greatly. To solve the problem in
practice, we propose efficient () approximation schemes. Finally,
we demonstrate empirically that with this setting we are able to discover
bursts that would have otherwise be undetected
Itemsets for Real-valued Datasets
Pattern mining is one of the most well-studied subfields in exploratory data
analysis. While there is a significant amount of literature on how to discover
and rank itemsets efficiently from binary data, there is surprisingly little
research done in mining patterns from real-valued data. In this paper we
propose a family of quality scores for real-valued itemsets. We approach the
problem by considering casting the dataset into a binary data and computing the
support from this data. This naive approach requires us to select thresholds.
To remedy this, instead of selecting one set of thresholds, we treat thresholds
as random variables and compute the average support. We show that we can
compute this support efficiently, and we also introduce two normalisations,
namely comparing the support against the independence assumption and, more
generally, against the partition assumption. Our experimental evaluation
demonstrates that we can discover statistically significant patterns
efficiently
Are your Items in Order?
Items in many datasets can be arranged to a natural order. Such orders are
useful since they can provide new knowledge about the data and may ease further
data exploration and visualization. Our goal in this paper is to define a
statistically well-founded and an objective score measuring the quality of an
order. Such a measure can be used for determining whether the current order has
any valuable information or can it be discarded.
Intuitively, we say that the order is good if dependent attributes are close
to each other. To define the order score we fit an order-sensitive model to the
dataset. Our model resembles a Markov chain model, that is, the attributes
depend only on the immediate neighbors. The score of the order is the BIC score
of the best model. For computing the measure we introduce a fast dynamic
program. The score is then compared against random orders: if it is better than
the scores of the random orders, we say that the order is good. We also show
the asymptotic connection between the score function and the number of free
parameters of the model. In addition, we introduce a simple greedy approach for
finding an order with a good score. We evaluate the score for synthetic and
real datasets using different spectral orders and the orders obtained with the
greedy method
Density-friendly Graph Decomposition
Decomposing a graph into a hierarchical structure via -core analysis is a
standard operation in any modern graph-mining toolkit. -core decomposition
is a simple and efficient method that allows to analyze a graph beyond its mere
degree distribution. More specifically, it is used to identify areas in the
graph of increasing centrality and connectedness, and it allows to reveal the
structural organization of the graph.
Despite the fact that -core analysis relies on vertex degrees, -cores
do not satisfy a certain, rather natural, density property. Simply put, the
most central -core is not necessarily the densest subgraph. This
inconsistency between -cores and graph density provides the basis of our
study.
We start by defining what it means for a subgraph to be locally-dense, and we
show that our definition entails a nested chain decomposition of the graph,
similar to the one given by -cores, but in this case the components are
arranged in order of increasing density. We show that such a locally-dense
decomposition for a graph can be computed in polynomial time. The
running time of the exact decomposition algorithm is but is
significantly faster in practice. In addition, we develop a linear-time
algorithm that provides a factor-2 approximation to the optimal locally-dense
decomposition. Furthermore, we show that the -core decomposition is also a
factor-2 approximation, however, as demonstrated by our experimental
evaluation, in practice -cores have different structure than locally-dense
subgraphs, and as predicted by the theory, -cores are not always
well-aligned with graph density.Comment: Journal version of the conference versio
Discovering Bands from Graphs
Discovering the underlying structure of a given graph is one of the
fundamental goals in graph mining. Given a graph, we can often order vertices
in a way that neighboring vertices have a higher probability of being connected
to each other. This implies that the edges form a band around the diagonal in
the adjacency matrix. Such structure may rise for example if the graph was
created over time: each vertex had an active time interval during which the
vertex was connected with other active vertices.
The goal of this paper is to model this phenomenon. To this end, we formulate
an optimization problem: given a graph and an integer , we want to order
graph vertices and partition the ordered adjacency matrix into bands such
that bands closer to the diagonal are more dense. We measure the goodness of a
segmentation using the log-likelihood of a log-linear model, a flexible family
of distributions containing many standard distributions. We divide the problem
into two subproblems: finding the order and finding the bands. We show that
discovering bands can be done in polynomial time with isotonic regression, and
we also introduce a heuristic iterative approach. For discovering the order we
use Fiedler order accompanied with a simple combinatorial refinement. We
demonstrate empirically that our heuristic works well in practice
Distances between Data Sets Based on Summary Statistics
The concepts of similarity and distance are crucial in data mining. We
consider the problem of defining the distance between two data sets by
comparing summary statistics computed from the data sets. The initial
definition of our distance is based on geometrical notions of certain sets of
distributions. We show that this distance can be computed in cubic time and
that it has several intuitive properties. We also show that this distance is
the unique Mahalanobis distance satisfying certain assumptions. We also
demonstrate that if we are dealing with binary data sets, then the distance can
be represented naturally by certain parity functions, and that it can be
evaluated in linear time. Our empirical tests with real world data show that
the distance works well
Mining Closed Strict Episodes
Discovering patterns in a sequence is an important aspect of data mining. One
popular choice of such patterns are episodes, patterns in sequential data
describing events that often occur in the vicinity of each other. Episodes also
enforce in which order the events are allowed to occur.
In this work we introduce a technique for discovering closed episodes.
Adopting existing approaches for discovering traditional patterns, such as
closed itemsets, to episodes is not straightforward. First of all, we cannot
define a unique closure based on frequency because an episode may have several
closed superepisodes. Moreover, to define a closedness concept for episodes we
need a subset relationship between episodes, which is not trivial to define.
We approach these problems by introducing strict episodes. We argue that this
class is general enough, and at the same time we are able to define a natural
subset relationship within it and use it efficiently. In order to mine closed
episodes we define an auxiliary closure operator. We show that this closure
satisfies the needed properties so that we can use the existing framework for
mining closed patterns. Discovering the true closed episodes can be done as a
post-processing step. We combine these observations into an efficient mining
algorithm and demonstrate empirically its performance in practice.Comment: Journal version. The previous version is the conference versio
- …