60,190 research outputs found
Controlling False Positives in Association Rule Mining
Association rule mining is an important problem in the data mining area. It
enumerates and tests a large number of rules on a dataset and outputs rules
that satisfy user-specified constraints. Due to the large number of rules being
tested, rules that do not represent real systematic effect in the data can
satisfy the given constraints purely by random chance. Hence association rule
mining often suffers from a high risk of false positive errors. There is a lack
of comprehensive study on controlling false positives in association rule
mining. In this paper, we adopt three multiple testing correction
approaches---the direct adjustment approach, the permutation-based approach and
the holdout approach---to control false positives in association rule mining,
and conduct extensive experiments to study their performance. Our results show
that (1) Numerous spurious rules are generated if no correction is made. (2)
The three approaches can control false positives effectively. Among the three
approaches, the permutation-based approach has the highest power of detecting
real association rules, but it is very computationally expensive. We employ
several techniques to reduce its cost effectively.Comment: VLDB201
Association Rules Mining Based Clinical Observations
Healthcare institutes enrich the repository of patients' disease related
information in an increasing manner which could have been more useful by
carrying out relational analysis. Data mining algorithms are proven to be quite
useful in exploring useful correlations from larger data repositories. In this
paper we have implemented Association Rules mining based a novel idea for
finding co-occurrences of diseases carried by a patient using the healthcare
repository. We have developed a system-prototype for Clinical State Correlation
Prediction (CSCP) which extracts data from patients' healthcare database,
transforms the OLTP data into a Data Warehouse by generating association rules.
The CSCP system helps reveal relations among the diseases. The CSCP system
predicts the correlation(s) among primary disease (the disease for which the
patient visits the doctor) and secondary disease/s (which is/are other
associated disease/s carried by the same patient having the primary disease).Comment: 5 pages, MEDINFO 2010, C. Safran et al. (Eds.), IOS Pres
A Model-Based Frequency Constraint for Mining Associations from Transaction Data
Mining frequent itemsets is a popular method for finding associated items in
databases. For this method, support, the co-occurrence frequency of the items
which form an association, is used as the primary indicator of the
associations's significance. A single user-specified support threshold is used
to decided if associations should be further investigated. Support has some
known problems with rare items, favors shorter itemsets and sometimes produces
misleading associations.
In this paper we develop a novel model-based frequency constraint as an
alternative to a single, user-specified minimum support. The constraint
utilizes knowledge of the process generating transaction data by applying a
simple stochastic mixture model (the NB model) which allows for transaction
data's typically highly skewed item frequency distribution. A user-specified
precision threshold is used together with the model to find local frequency
thresholds for groups of itemsets. Based on the constraint we develop the
notion of NB-frequent itemsets and adapt a mining algorithm to find all
NB-frequent itemsets in a database. In experiments with publicly available
transaction databases we show that the new constraint provides improvements
over a single minimum support threshold and that the precision threshold is
more robust and easier to set and interpret by the user
- …