102 research outputs found
Finding Associations and Computing Similarity via Biased Pair Sampling
This version is ***superseded*** by a full version that can be found at
http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger
theoretical results and fixes a mistake in the reporting of experiments.
Abstract: Sampling-based methods have previously been proposed for the
problem of finding interesting associations in data, even for low-support
items. While these methods do not guarantee precise results, they can be vastly
more efficient than approaches that rely on exact counting. However, for many
similarity measures no such methods have been known. In this paper we show how
a wide variety of measures can be supported by a simple biased sampling method.
The method also extends to find high-confidence association rules. We
demonstrate theoretically that our method is superior to exact methods when the
threshold for "interesting similarity/confidence" is above the average pairwise
similarity/confidence, and the average support is not too low. Our method is
particularly good when transactions contain many items. We confirm in
experiments on standard association mining benchmarks that this gives a
significant speedup on real data sets (sometimes much larger than the
theoretical guarantees). Reductions in computation time of over an order of
magnitude, and significant savings in space, are observed.Comment: This is an extended version of a paper that appeared at the IEEE
International Conference on Data Mining, 2009. The conference version is (c)
2009 IEE
A multithreaded hybrid framework for mining frequent itemsets
Mining frequent itemsets is an area of data mining that has beguiled several researchers in recent years. Varied data structures such as Nodesets, DiffNodesets, NegNodesets, N-lists, and Diffsets are among a few that were employed to extract frequent items. However, most of these approaches fell short either in respect of run time or memory. Hybrid frameworks were formulated to repress these issues that encompass the deployment of two or more data structures to facilitate effective mining of frequent itemsets. Such an approach aims to exploit the advantages of either of the data structures while mitigating the problems of relying on either of them alone. However, limited efforts have been made to reinforce the efficiency of such frameworks. To address these issues this paper proposes a novel multithreaded hybrid framework comprising of NegNodesets and N-list structure that uses the multicore feature of today’s processors. While NegNodesets offer a concise representation of itemsets, N-lists rely on List intersection thereby speeding up the mining process. To optimize the extraction of frequent items a hash-based algorithm has been designed here to extract the resultant set of frequent items which further enhances the novelty of the framework
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
The tasks of extracting (top-) Frequent Itemsets (FI's) and Association
Rules (AR's) are fundamental primitives in data mining and database
applications. Exact algorithms for these problems exist and are widely used,
but their running time is hindered by the need of scanning the entire dataset,
possibly multiple times. High quality approximations of FI's and AR's are
sufficient for most practical uses, and a number of recent works explored the
application of sampling for fast discovery of approximate solutions to the
problems. However, these works do not provide satisfactory performance
guarantees on the quality of the approximation, due to the difficulty of
bounding the probability of under- or over-sampling any one of an unknown
number of frequent itemsets. In this work we circumvent this issue by applying
the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop
a novel technique for providing tight bounds on the sample size that guarantees
approximation within user-specified parameters. Our technique applies both to
absolute and to relative approximations of (top-) FI's and AR's. The
resulting sample size is linearly dependent on the VC-dimension of a range
space associated with the dataset to be mined. The main theoretical
contribution of this work is a proof that the VC-dimension of this range space
is upper bounded by an easy-to-compute characteristic quantity of the dataset
which we call \emph{d-index}, and is the maximum integer such that the
dataset contains at least transactions of length at least such that no
one of them is a superset of or equal to another. We show that this bound is
strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the
proceedings of ECML PKDD 201
- …