1,297 research outputs found
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
The tasks of extracting (top-) Frequent Itemsets (FI's) and Association
Rules (AR's) are fundamental primitives in data mining and database
applications. Exact algorithms for these problems exist and are widely used,
but their running time is hindered by the need of scanning the entire dataset,
possibly multiple times. High quality approximations of FI's and AR's are
sufficient for most practical uses, and a number of recent works explored the
application of sampling for fast discovery of approximate solutions to the
problems. However, these works do not provide satisfactory performance
guarantees on the quality of the approximation, due to the difficulty of
bounding the probability of under- or over-sampling any one of an unknown
number of frequent itemsets. In this work we circumvent this issue by applying
the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop
a novel technique for providing tight bounds on the sample size that guarantees
approximation within user-specified parameters. Our technique applies both to
absolute and to relative approximations of (top-) FI's and AR's. The
resulting sample size is linearly dependent on the VC-dimension of a range
space associated with the dataset to be mined. The main theoretical
contribution of this work is a proof that the VC-dimension of this range space
is upper bounded by an easy-to-compute characteristic quantity of the dataset
which we call \emph{d-index}, and is the maximum integer such that the
dataset contains at least transactions of length at least such that no
one of them is a superset of or equal to another. We show that this bound is
strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the
proceedings of ECML PKDD 201
Mining Top-K Frequent Itemsets Through Progressive Sampling
We study the use of sampling for efficiently mining the top-K frequent
itemsets of cardinality at most w. To this purpose, we define an approximation
to the top-K frequent itemsets to be a family of itemsets which includes
(resp., excludes) all very frequent (resp., very infrequent) itemsets, together
with an estimate of these itemsets' frequencies with a bounded error. Our first
result is an upper bound on the sample size which guarantees that the top-K
frequent itemsets mined from a random sample of that size approximate the
actual top-K frequent itemsets, with probability larger than a specified value.
We show that the upper bound is asymptotically tight when w is constant. Our
main algorithmic contribution is a progressive sampling approach, combined with
suitable stopping conditions, which on appropriate inputs is able to extract
approximate top-K frequent itemsets from samples whose sizes are smaller than
the general upper bound. In order to test the stopping conditions, this
approach maintains the frequency of all itemsets encountered, which is
practical only for small w. However, we show how this problem can be mitigated
by using a variation of Bloom filters. A number of experiments conducted on
both synthetic and real bench- mark datasets show that using samples
substantially smaller than the original dataset (i.e., of size defined by the
upper bound or reached through the progressive sampling approach) enable to
approximate the actual top-K frequent itemsets with accuracy much higher than
what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and
publication in the ECML PKDD 2010 special issue of the Data Mining and
Knowledge Discovery journa
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
Interactive Constrained Association Rule Mining
We investigate ways to support interactive mining sessions, in the setting of
association rule mining. In such sessions, users specify conditions (queries)
on the associations to be generated. Our approach is a combination of the
integration of querying conditions inside the mining phase, and the incremental
querying of already generated associations. We present several concrete
algorithms and compare their performance.Comment: A preliminary report on this work was presented at the Second
International Conference on Knowledge Discovery and Data Mining (DaWaK 2000
- …