5,632 research outputs found
Maximum Entropy Based Significance of Itemsets
We consider the problem of defining the significance of an itemset. We say
that the itemset is significant if we are surprised by its frequency when
compared to the frequencies of its sub-itemsets. In other words, we estimate
the frequency of the itemset from the frequencies of its sub-itemsets and
compute the deviation between the real value and the estimate. For the
estimation we use Maximum Entropy and for measuring the deviation we use
Kullback-Leibler divergence.
A major advantage compared to the previous methods is that we are able to use
richer models whereas the previous approaches only measure the deviation from
the independence model.
We show that our measure of significance goes to zero for derivable itemsets
and that we can use the rank as a statistical test. Our empirical results
demonstrate that for our real datasets the independence assumption is too
strong but applying more flexible models leads to good results.Comment: Journal version. The previous version is the conference pape
Distances between Data Sets Based on Summary Statistics
The concepts of similarity and distance are crucial in data mining. We
consider the problem of defining the distance between two data sets by
comparing summary statistics computed from the data sets. The initial
definition of our distance is based on geometrical notions of certain sets of
distributions. We show that this distance can be computed in cubic time and
that it has several intuitive properties. We also show that this distance is
the unique Mahalanobis distance satisfying certain assumptions. We also
demonstrate that if we are dealing with binary data sets, then the distance can
be represented naturally by certain parity functions, and that it can be
evaluated in linear time. Our empirical tests with real world data show that
the distance works well
On Rigid, Hard and Soft Problems and Results in Arithmetic Geometry
Rigid, hard and soft problems and results in arithmetic geometry are
presented. "Soft" and "hard" in our paper are limited to the framework of
solutions of quadratic forms over rings of integers of local and global fields,
the Hardy-Littlewood-Kloosterman method. Next we consider the notion of
rigidity. In the framework we give review of some novel results in the aria.Comment: 6 page
Discovering bursts revisited: guaranteed optimization of the model parameters
One of the classic data mining tasks is to discover bursts, time intervals,
where events occur at abnormally high rate. In this paper we revisit
Kleinberg's seminal work, where bursts are discovered by using exponential
distribution with a varying rate parameter: the regions where it is more
advantageous to set the rate higher are deemed bursty. The model depends on two
parameters, the initial rate and the change rate. The initial rate, that is,
the rate that is used when there are no burstiness was set to the average rate
over the whole sequence. The change rate is provided by the user.
We argue that these choices are suboptimal: it leads to worse likelihood, and
may lead to missing some existing bursts. We propose an alternative problem
setting, where the model parameters are selected by optimizing the likelihood
of the model. While this tweak is trivial from the problem definition point of
view, this changes the optimization problem greatly. To solve the problem in
practice, we propose efficient () approximation schemes. Finally,
we demonstrate empirically that with this setting we are able to discover
bursts that would have otherwise be undetected
Ranking Episodes using a Partition Model
One of the biggest setbacks in traditional frequent pattern mining is that
overwhelmingly many of the discovered patterns are redundant. A prototypical
example of such redundancy is a freerider pattern where the pattern contains a
true pattern and some additional noise events. A technique for filtering
freerider patterns that has proved to be efficient in ranking itemsets is to
use a partition model where a pattern is divided into two subpatterns and the
observed support is compared to the expected support under the assumption that
these two subpatterns occur independently.
In this paper we develop a partition model for episodes, patterns discovered
from sequential data. An episode is essentially a set of events, with possible
restrictions on the order of events. Unlike with itemset mining, computing the
expected support of an episode requires surprisingly sophisticated methods. In
order to construct the model, we partition the episode into two subepisodes. We
then model how likely the events in each subepisode occur close to each other.
If this probability is high---which is often the case if the subepisode has a
high support---then we can expect that when one event from a subepisode occurs,
then the remaining events occur also close by. This approach increases the
expected support of the episode, and if this increase explains the observed
support, then we can deem the episode uninteresting. We demonstrate in our
experiments that using the partition model can effectively and efficiently
reduce the redundancy in episodes
On Two Moduli Problems Concerning Number of Points and Equidistribution over Prime Finite Fields
Problems of (i) precise (exact) bound for families of hyperelliptic curves
over prime finite fields and (ii) equidistribution of angles of Kloosterman
sums are discussed.Comment: 5 pages. Extended version of Proc. from Int. Conf. on Discrete Models
in Control System Theory, Krasnovidovo, (June 22-27, 1998),- Moscow: MG
Density-friendly Graph Decomposition
Decomposing a graph into a hierarchical structure via -core analysis is a
standard operation in any modern graph-mining toolkit. -core decomposition
is a simple and efficient method that allows to analyze a graph beyond its mere
degree distribution. More specifically, it is used to identify areas in the
graph of increasing centrality and connectedness, and it allows to reveal the
structural organization of the graph.
Despite the fact that -core analysis relies on vertex degrees, -cores
do not satisfy a certain, rather natural, density property. Simply put, the
most central -core is not necessarily the densest subgraph. This
inconsistency between -cores and graph density provides the basis of our
study.
We start by defining what it means for a subgraph to be locally-dense, and we
show that our definition entails a nested chain decomposition of the graph,
similar to the one given by -cores, but in this case the components are
arranged in order of increasing density. We show that such a locally-dense
decomposition for a graph can be computed in polynomial time. The
running time of the exact decomposition algorithm is but is
significantly faster in practice. In addition, we develop a linear-time
algorithm that provides a factor-2 approximation to the optimal locally-dense
decomposition. Furthermore, we show that the -core decomposition is also a
factor-2 approximation, however, as demonstrated by our experimental
evaluation, in practice -cores have different structure than locally-dense
subgraphs, and as predicted by the theory, -cores are not always
well-aligned with graph density.Comment: Journal version of the conference versio
Discovering Bands from Graphs
Discovering the underlying structure of a given graph is one of the
fundamental goals in graph mining. Given a graph, we can often order vertices
in a way that neighboring vertices have a higher probability of being connected
to each other. This implies that the edges form a band around the diagonal in
the adjacency matrix. Such structure may rise for example if the graph was
created over time: each vertex had an active time interval during which the
vertex was connected with other active vertices.
The goal of this paper is to model this phenomenon. To this end, we formulate
an optimization problem: given a graph and an integer , we want to order
graph vertices and partition the ordered adjacency matrix into bands such
that bands closer to the diagonal are more dense. We measure the goodness of a
segmentation using the log-likelihood of a log-linear model, a flexible family
of distributions containing many standard distributions. We divide the problem
into two subproblems: finding the order and finding the bands. We show that
discovering bands can be done in polynomial time with isotonic regression, and
we also introduce a heuristic iterative approach. For discovering the order we
use Fiedler order accompanied with a simple combinatorial refinement. We
demonstrate empirically that our heuristic works well in practice
Safe projections of binary data sets
Selectivity estimation of a boolean query based on frequent itemsets can be
solved by describing the problem by a linear program. However, the number of
variables in the equations is exponential, rendering the approach tractable
only for small-dimensional cases. One natural approach would be to project the
data to the variables occurring in the query. This can, however, change the
outcome of the linear program.
We introduce the concept of safe sets: projecting the data to a safe set does
not change the outcome of the linear program. We characterise safe sets using
graph theoretic concepts and give an algorithm for finding minimal safe sets
containing given attributes. We describe a heuristic algorithm for finding
almost-safe sets given a size restriction, and show empirically that these sets
outperform the trivial projection.
We also show a connection between safe sets and Markov Random Fields and use
it to further reduce the number of variables in the linear program, given some
regularity assumptions on the frequent itemsets
Itemsets for Real-valued Datasets
Pattern mining is one of the most well-studied subfields in exploratory data
analysis. While there is a significant amount of literature on how to discover
and rank itemsets efficiently from binary data, there is surprisingly little
research done in mining patterns from real-valued data. In this paper we
propose a family of quality scores for real-valued itemsets. We approach the
problem by considering casting the dataset into a binary data and computing the
support from this data. This naive approach requires us to select thresholds.
To remedy this, instead of selecting one set of thresholds, we treat thresholds
as random variables and compute the average support. We show that we can
compute this support efficiently, and we also introduce two normalisations,
namely comparing the support against the independence assumption and, more
generally, against the partition assumption. Our experimental evaluation
demonstrates that we can discover statistically significant patterns
efficiently
- …