87,264 research outputs found
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
The tasks of extracting (top-) Frequent Itemsets (FI's) and Association
Rules (AR's) are fundamental primitives in data mining and database
applications. Exact algorithms for these problems exist and are widely used,
but their running time is hindered by the need of scanning the entire dataset,
possibly multiple times. High quality approximations of FI's and AR's are
sufficient for most practical uses, and a number of recent works explored the
application of sampling for fast discovery of approximate solutions to the
problems. However, these works do not provide satisfactory performance
guarantees on the quality of the approximation, due to the difficulty of
bounding the probability of under- or over-sampling any one of an unknown
number of frequent itemsets. In this work we circumvent this issue by applying
the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop
a novel technique for providing tight bounds on the sample size that guarantees
approximation within user-specified parameters. Our technique applies both to
absolute and to relative approximations of (top-) FI's and AR's. The
resulting sample size is linearly dependent on the VC-dimension of a range
space associated with the dataset to be mined. The main theoretical
contribution of this work is a proof that the VC-dimension of this range space
is upper bounded by an easy-to-compute characteristic quantity of the dataset
which we call \emph{d-index}, and is the maximum integer such that the
dataset contains at least transactions of length at least such that no
one of them is a superset of or equal to another. We show that this bound is
strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the
proceedings of ECML PKDD 201
Data mining in medical records for the enhancement of strategic decisions: a case study
The impact and popularity of competition concept has been increasing in the last decades and this concept has escalated the importance of giving right decision for organizations. Decision makers have encountered the fact of using proper scientific methods instead of using intuitive and emotional choices in decision making process. In this context, many decision support models and relevant systems are still being developed in order to assist the strategic management mechanisms. There is also a critical need for automated approaches for effective and efficient utilization of massive amount of data to support corporate and individuals in strategic planning and decision-making. Data mining techniques have been used to uncover hidden patterns and relations, to summarize the data in novel ways that are both understandable and useful to the executives and also to predict future trends and behaviors in business. There has been a large body of research and practice focusing on different data mining techniques and methodologies. In this study, a large volume of record set extracted from an outpatient clinic’s medical database is used to apply data mining techniques. In the first phase of the study, the raw data in the record set are collected, preprocessed, cleaned up and eventually transformed into a suitable format for data mining. In the second phase, some of the association rule algorithms are applied to the data set in order to uncover rules for quantifying the relationship between some of the attributes in the medical records. The results are observed and comparative analysis of the observed results among different association algorithms is made. The results showed us that some critical and reasonable relations exist in the outpatient clinic operations of the hospital which could aid the hospital management to change and improve their managerial strategies regarding the quality of services given to outpatients.Decision Making, Medical Records, Data Mining, Association Rules, Outpatient Clinic.
- …