Search CORE

1,167 research outputs found

An efficient closed frequent itemset miner for the MOA stream mining system

Author: Bifet Figuerol Albert Carles
Gavaldà Mestre Ricard
Quadrana Massimo
Publication venue
Publication date: 01/01/2013
Field of study

Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

A Model-Based Frequency Constraint for Mining Associations from Transaction Data

Author: Hahsler Michael
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user

arXiv.org e-Print Archive

CiteSeerX

Interactive Constrained Association Rule Mining

Author: Bussche Jan Van den
Goethals Bart
Publication venue
Publication date: 01/01/2003
Field of study

We investigate ways to support interactive mining sessions, in the setting of association rule mining. In such sessions, users specify conditions (queries) on the associations to be generated. Our approach is a combination of the integration of querying conditions inside the mining phase, and the incremental querying of already generated associations. We present several concrete algorithms and compare their performance.Comment: A preliminary report on this work was presented at the Second International Conference on Knowledge Discovery and Data Mining (DaWaK 2000

arXiv.org e-Print Archive

CiteSeerX

Finding the True Frequent Itemsets

Author: Riondato Matteo
Vandin Fabio
Publication venue
Publication date: 01/01/2013
Field of study

Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction

\theta

of a transactional dataset

\mathcal{D}

. Often though, the ultimate goal of mining

\mathcal{D}

is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications

\mathcal{D}

is a collection of samples obtained from an unknown probability distribution

\pi

on transactions, and by extracting the FIs in

\mathcal{D}

one attempts to infer itemsets that are frequently (i.e., with probability at least

\theta

) generated by

\pi

, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of \emph{false positives}, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold

\hat{\theta}

such that the collection of itemsets with frequency at least

\hat{\theta}

\mathcal{D}

contains only TFIs with probability at least

1-\delta

, for some user-specified

\delta

. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of

\mathcal{D}

at frequency

\theta

and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.Comment: 13 pages, Extended version of work appeared in SIAM International Conference on Data Mining, 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

University of Southern Denmark Research Output

Archivio istituzionale della ricerca - Università di Padova

A Framework for High-Accuracy Privacy-Preserving Mining

Author: Agrawal Shipra
Haritsa Jayant R.
Publication venue
Publication date: 01/01/2004
Field of study

To preserve client privacy in the data mining process, a variety of techniques based on random perturbation of data records have been proposed recently. In this paper, we present a generalized matrix-theoretic model of random perturbation, which facilitates a systematic approach to the design of perturbation mechanisms for privacy-preserving mining. Specifically, we demonstrate that (a) the prior techniques differ only in their settings for the model parameters, and (b) through appropriate choice of parameter settings, we can derive new perturbation techniques that provide highly accurate mining results even under strict privacy guarantees. We also propose a novel perturbation mechanism wherein the model parameters are themselves characterized as random variables, and demonstrate that this feature provides significant improvements in privacy at a very marginal cost in accuracy. While our model is valid for random-perturbation-based privacy-preserving mining in general, we specifically evaluate its utility here with regard to frequent-itemset mining on a variety of real datasets. The experimental results indicate that our mechanisms incur substantially lower identity and support errors as compared to the prior techniques

arXiv.org e-Print Archive

CiteSeerX

Open Access Repository of IISc Research Publications