743 research outputs found
Towards Rare Itemset Mining
site de la conférence : http://ictai07.ceid.upatras.gr/International audienceWe describe here a general approach for rare itemset mining. While mining literature has been almost exclusively focused on frequent itemsets, in many practical situations rare ones are of higher interest (e.g., in medical databases, rare combinations of symptoms might provide useful insights for the physicians). Based on an examination of the relevant substructures of the mining space, our approach splits the rare itemset mining task into two steps, i.e., frequent itemset part traversal and rare itemset listing. We propose two algorithms for step one, a naive and an optimized one, respectively, and another algorithm for step two. We also provide some empirical evidence about the performance gains due to the optimized traversal
A Fast Minimal Infrequent Itemset Mining Algorithm
A novel fast algorithm for finding quasi identifiers in large datasets is
presented. Performance measurements on a broad range of datasets demonstrate
substantial reductions in run-time relative to the state of the art and the
scalability of the algorithm to realistically-sized datasets up to several
million records
A Model-Based Frequency Constraint for Mining Associations from Transaction Data
Mining frequent itemsets is a popular method for finding associated items in
databases. For this method, support, the co-occurrence frequency of the items
which form an association, is used as the primary indicator of the
associations's significance. A single user-specified support threshold is used
to decided if associations should be further investigated. Support has some
known problems with rare items, favors shorter itemsets and sometimes produces
misleading associations.
In this paper we develop a novel model-based frequency constraint as an
alternative to a single, user-specified minimum support. The constraint
utilizes knowledge of the process generating transaction data by applying a
simple stochastic mixture model (the NB model) which allows for transaction
data's typically highly skewed item frequency distribution. A user-specified
precision threshold is used together with the model to find local frequency
thresholds for groups of itemsets. Based on the constraint we develop the
notion of NB-frequent itemsets and adapt a mining algorithm to find all
NB-frequent itemsets in a database. In experiments with publicly available
transaction databases we show that the new constraint provides improvements
over a single minimum support threshold and that the precision threshold is
more robust and easier to set and interpret by the user
The Coron System
Coron is a domain and platform independent, multi-purposed data mining
toolkit, which incorporates not only a rich collection of data mining
algorithms, but also allows a number of auxiliary operations. To the best of
our knowledge, a data mining toolkit designed specifically for itemset
extraction and association rule generation like Coron does not exist elsewhere.
Coron also provides support for preparing and filtering data, and for
interpreting the extracted units of knowledge
Implications of probabilistic data modeling for rule mining
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine associations are discussed in great detail. In this paper we investigate properties of transaction data sets from a probabilistic point of view. We present a simple probabilistic framework for transaction data and its implementation using the R statistical computing environment. The framework can be used to simulate transaction data when no associations are present. We use such data to explore the ability to filter noise of confidence and lift, two popular interest measures used for rule mining. Based on the framework we develop the measure hyperlift and we compare this new measure to lift using simulated data and a real-world grocery database.Series: Research Report Series / Department of Statistics and Mathematic
A Constraint Programming Approach for Mining Sequential Patterns in a Sequence Database
Constraint-based pattern discovery is at the core of numerous data mining
tasks. Patterns are extracted with respect to a given set of constraints
(frequency, closedness, size, etc). In the context of sequential pattern
mining, a large number of devoted techniques have been developed for solving
particular classes of constraints. The aim of this paper is to investigate the
use of Constraint Programming (CP) to model and mine sequential patterns in a
sequence database. Our CP approach offers a natural way to simultaneously
combine in a same framework a large set of constraints coming from various
origins. Experiments show the feasibility and the interest of our approach
- …