Search CORE

1,501 research outputs found

Maximum Entropy Based Significance of Itemsets

Author: Tatti Nikolaj
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/04/2019
Field of study

We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback-Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.Comment: Journal version. The previous version is the conference pape

arXiv.org e-Print Archive

Ranking Episodes using a Partition Model

Author: Tatti Nikolaj
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/02/2019
Field of study

One of the biggest setbacks in traditional frequent pattern mining is that overwhelmingly many of the discovered patterns are redundant. A prototypical example of such redundancy is a freerider pattern where the pattern contains a true pattern and some additional noise events. A technique for filtering freerider patterns that has proved to be efficient in ranking itemsets is to use a partition model where a pattern is divided into two subpatterns and the observed support is compared to the expected support under the assumption that these two subpatterns occur independently. In this paper we develop a partition model for episodes, patterns discovered from sequential data. An episode is essentially a set of events, with possible restrictions on the order of events. Unlike with itemset mining, computing the expected support of an episode requires surprisingly sophisticated methods. In order to construct the model, we partition the episode into two subepisodes. We then model how likely the events in each subepisode occur close to each other. If this probability is high---which is often the case if the subepisode has a high support---then we can expect that when one event from a subepisode occurs, then the remaining events occur also close by. This approach increases the expected support of the episode, and if this increase explains the observed support, then we can deem the episode uninteresting. We demonstrate in our experiments that using the partition model can effectively and efficiently reduce the redundancy in episodes

arXiv.org e-Print Archive

Discovering bursts revisited: guaranteed optimization of the model parameters

Author: Tatti Nikolaj
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 05/02/2019
Field of study

One of the classic data mining tasks is to discover bursts, time intervals, where events occur at abnormally high rate. In this paper we revisit Kleinberg's seminal work, where bursts are discovered by using exponential distribution with a varying rate parameter: the regions where it is more advantageous to set the rate higher are deemed bursty. The model depends on two parameters, the initial rate and the change rate. The initial rate, that is, the rate that is used when there are no burstiness was set to the average rate over the whole sequence. The change rate is provided by the user. We argue that these choices are suboptimal: it leads to worse likelihood, and may lead to missing some existing bursts. We propose an alternative problem setting, where the model parameters are selected by optimizing the likelihood of the model. While this tweak is trivial from the problem definition point of view, this changes the optimization problem greatly. To solve the problem in practice, we propose efficient (

1 + \epsilon

) approximation schemes. Finally, we demonstrate empirically that with this setting we are able to discover bursts that would have otherwise be undetected

arXiv.org e-Print Archive

Safe projections of binary data sets

Author: Tatti Nikolaj
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/02/2019
Field of study

Selectivity estimation of a boolean query based on frequent itemsets can be solved by describing the problem by a linear program. However, the number of variables in the equations is exponential, rendering the approach tractable only for small-dimensional cases. One natural approach would be to project the data to the variables occurring in the query. This can, however, change the outcome of the linear program. We introduce the concept of safe sets: projecting the data to a safe set does not change the outcome of the linear program. We characterise safe sets using graph theoretic concepts and give an algorithm for finding minimal safe sets containing given attributes. We describe a heuristic algorithm for finding almost-safe sets given a size restriction, and show empirically that these sets outperform the trivial projection. We also show a connection between safe sets and Markov Random Fields and use it to further reduce the number of variables in the linear program, given some regularity assumptions on the frequent itemsets

arXiv.org e-Print Archive

Itemsets for Real-valued Datasets

Author: Tatti Nikolaj
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/02/2019
Field of study

Pattern mining is one of the most well-studied subfields in exploratory data analysis. While there is a significant amount of literature on how to discover and rank itemsets efficiently from binary data, there is surprisingly little research done in mining patterns from real-valued data. In this paper we propose a family of quality scores for real-valued itemsets. We approach the problem by considering casting the dataset into a binary data and computing the support from this data. This naive approach requires us to select thresholds. To remedy this, instead of selecting one set of thresholds, we treat thresholds as random variables and compute the average support. We show that we can compute this support efficiently, and we also introduce two normalisations, namely comparing the support against the independence assumption and, more generally, against the partition assumption. Our experimental evaluation demonstrates that we can discover statistically significant patterns efficiently

arXiv.org e-Print Archive

Are your Items in Order?

Author: Tatti Nikolaj
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 07/02/2019
Field of study

Items in many datasets can be arranged to a natural order. Such orders are useful since they can provide new knowledge about the data and may ease further data exploration and visualization. Our goal in this paper is to define a statistically well-founded and an objective score measuring the quality of an order. Such a measure can be used for determining whether the current order has any valuable information or can it be discarded. Intuitively, we say that the order is good if dependent attributes are close to each other. To define the order score we fit an order-sensitive model to the dataset. Our model resembles a Markov chain model, that is, the attributes depend only on the immediate neighbors. The score of the order is the BIC score of the best model. For computing the measure we introduce a fast dynamic program. The score is then compared against random orders: if it is better than the scores of the random orders, we say that the order is good. We also show the asymptotic connection between the score function and the number of free parameters of the model. In addition, we introduce a simple greedy approach for finding an order with a good score. We evaluate the score for synthetic and real datasets using different spectral orders and the orders obtained with the greedy method

arXiv.org e-Print Archive

Discovering Bands from Graphs

Author: Tatti Nikolaj
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/04/2019
Field of study

Discovering the underlying structure of a given graph is one of the fundamental goals in graph mining. Given a graph, we can often order vertices in a way that neighboring vertices have a higher probability of being connected to each other. This implies that the edges form a band around the diagonal in the adjacency matrix. Such structure may rise for example if the graph was created over time: each vertex had an active time interval during which the vertex was connected with other active vertices. The goal of this paper is to model this phenomenon. To this end, we formulate an optimization problem: given a graph and an integer

K

, we want to order graph vertices and partition the ordered adjacency matrix into

K

bands such that bands closer to the diagonal are more dense. We measure the goodness of a segmentation using the log-likelihood of a log-linear model, a flexible family of distributions containing many standard distributions. We divide the problem into two subproblems: finding the order and finding the bands. We show that discovering bands can be done in polynomial time with isotonic regression, and we also introduce a heuristic iterative approach. For discovering the order we use Fiedler order accompanied with a simple combinatorial refinement. We demonstrate empirically that our heuristic works well in practice

arXiv.org e-Print Archive

Density-friendly Graph Decomposition

Author: Tatti Nikolaj
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/01/2020
Field of study

Decomposing a graph into a hierarchical structure via

k

-core analysis is a standard operation in any modern graph-mining toolkit.

k

-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph. Despite the fact that

k

-core analysis relies on vertex degrees,

k

-cores do not satisfy a certain, rather natural, density property. Simply put, the most central

k

-core is not necessarily the densest subgraph. This inconsistency between

k

-cores and graph density provides the basis of our study. We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by

k

-cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph

G=(V,E)

can be computed in polynomial time. The running time of the exact decomposition algorithm is

O(|V|^2|E|)

but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the

k

-core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice

k

-cores have different structure than locally-dense subgraphs, and as predicted by the theory,

k

-cores are not always well-aligned with graph density.Comment: Journal version of the conference versio

arXiv.org e-Print Archive

Distances between Data Sets Based on Summary Statistics

Author: Tatti Nikolaj
Publication venue
Publication date: 04/02/2019
Field of study

The concepts of similarity and distance are crucial in data mining. We consider the problem of defining the distance between two data sets by comparing summary statistics computed from the data sets. The initial definition of our distance is based on geometrical notions of certain sets of distributions. We show that this distance can be computed in cubic time and that it has several intuitive properties. We also show that this distance is the unique Mahalanobis distance satisfying certain assumptions. We also demonstrate that if we are dealing with binary data sets, then the distance can be represented naturally by certain parity functions, and that it can be evaluated in linear time. Our empirical tests with real world data show that the distance works well

arXiv.org e-Print Archive

Mining Closed Strict Episodes

Author: Cule Boris
Tatti Nikolaj
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/04/2019
Field of study

Discovering patterns in a sequence is an important aspect of data mining. One popular choice of such patterns are episodes, patterns in sequential data describing events that often occur in the vicinity of each other. Episodes also enforce in which order the events are allowed to occur. In this work we introduce a technique for discovering closed episodes. Adopting existing approaches for discovering traditional patterns, such as closed itemsets, to episodes is not straightforward. First of all, we cannot define a unique closure based on frequency because an episode may have several closed superepisodes. Moreover, to define a closedness concept for episodes we need a subset relationship between episodes, which is not trivial to define. We approach these problems by introducing strict episodes. We argue that this class is general enough, and at the same time we are able to define a natural subset relationship within it and use it efficiently. In order to mine closed episodes we define an auxiliary closure operator. We show that this closure satisfies the needed properties so that we can use the existing framework for mining closed patterns. Discovering the true closed episodes can be done as a post-processing step. We combine these observations into an efficient mining algorithm and demonstrate empirically its performance in practice.Comment: Journal version. The previous version is the conference versio

arXiv.org e-Print Archive