166 research outputs found
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
As advances in technology allow for the collection, storage, and analysis of
vast amounts of data, the task of screening and assessing the significance of
discovered patterns is becoming a major challenge in data mining applications.
In this work, we address significance in the context of frequent itemset
mining. Specifically, we develop a novel methodology to identify a meaningful
support threshold s* for a dataset, such that the number of itemsets with
support at least s* represents a substantial deviation from what would be
expected in a random dataset with the same number of transactions and the same
individual item frequencies. These itemsets can then be flagged as
statistically significant with a small false discovery rate. We present
extensive experimental results to substantiate the effectiveness of our
methodology.Comment: A preliminary version of this work was presented in ACM PODS 2009. 20
pages, 0 figure
Finding the True Frequent Itemsets
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It
requires to identify all itemsets appearing in at least a fraction of
a transactional dataset . Often though, the ultimate goal of
mining is not an analysis of the dataset \emph{per se}, but the
understanding of the underlying process that generated it. Specifically, in
many applications is a collection of samples obtained from an
unknown probability distribution on transactions, and by extracting the
FIs in one attempts to infer itemsets that are frequently (i.e.,
with probability at least ) generated by , which we call the True
Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the
generative process, the set of FIs is only a rough approximation of the set of
TFIs, as it often contains a huge number of \emph{false positives}, i.e.,
spurious itemsets that are not among the TFIs. In this work we design and
analyze an algorithm to identify a threshold such that the
collection of itemsets with frequency at least in
contains only TFIs with probability at least , for some
user-specified . Our method uses results from statistical learning
theory involving the (empirical) VC-dimension of the problem at hand. This
allows us to identify almost all the TFIs without including any false positive.
We also experimentally compare our method with the direct mining of
at frequency and with techniques based on widely-used
standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and
show that our algorithm outperforms these methods and achieves even better
results than what is guaranteed by the theoretical analysis.Comment: 13 pages, Extended version of work appeared in SIAM International
Conference on Data Mining, 201
Evaluation and optimization of frequent association rule based classification
Deriving useful and interesting rules from a data mining system is an essential and important task. Problems
such as the discovery of random and coincidental patterns or patterns with no significant values, and the
generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness
of rules generated by data mining algorithms are actively and constantly being examined and developed. In this
paper, a systematic way to evaluate the association rules discovered from frequent itemset mining algorithms,
combining common data mining and statistical interestingness measures, and outline an appropriated sequence of usage is presented. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided. Empirical results show that with a proper combination of data mining and statistical analysis, the framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy and coverage rules when used in the classification problem. Moreover, the results reveal the important characteristics of mining frequent itemsets, and the impact of confidence measure for the classification task
Prediction of Metabolic Pathways Involvement in Prokaryotic UniProtKB Data by Association Rule Mining
The widening gap between known proteins and their functions has encouraged
the development of methods to automatically infer annotations. Automatic
functional annotation of proteins is expected to meet the conflicting
requirements of maximizing annotation coverage, while minimizing erroneous
functional assignments. This trade-off imposes a great challenge in designing
intelligent systems to tackle the problem of automatic protein annotation. In
this work, we present a system that utilizes rule mining techniques to predict
metabolic pathways in prokaryotes. The resulting knowledge represents
predictive models that assign pathway involvement to UniProtKB entries. We
carried out an evaluation study of our system performance using
cross-validation technique. We found that it achieved very promising results in
pathway identification with an F1-measure of 0.982 and an AUC of 0.987. Our
prediction models were then successfully applied to 6.2 million
UniProtKB/TrEMBL reference proteome entries of prokaryotes. As a result,
663,724 entries were covered, where 436,510 of them lacked any previous pathway
annotations
MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
We present MCRapper, an algorithm for efficient computation of Monte-Carlo
Empirical Rademacher Averages (MCERA) for families of functions exhibiting
poset (e.g., lattice) structure, such as those that arise in many pattern
mining tasks. The MCERA allows us to compute upper bounds to the maximum
deviation of sample means from their expectations, thus it can be used to find
both statistically-significant functions (i.e., patterns) when the available
data is seen as a sample from an unknown distribution, and approximations of
collections of high-expectation functions (e.g., frequent patterns) when the
available data is a small sample from a large dataset. This feature is a strong
improvement over previously proposed solutions that could only achieve one of
the two. MCRapper uses upper bounds to the discrepancy of the functions to
efficiently explore and prune the search space, a technique borrowed from
pattern mining itself. To show the practical use of MCRapper, we employ it to
develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining.
TFP-R gives guarantees on the probability of including any false positives
(precision) and exhibits higher statistical power (recall) than existing
methods offering the same guarantees. We evaluate MCRapper and TFP-R and show
that they outperform the state-of-the-art for their respective tasks
Doctor of Philosophy
dissertationWith the growing national dissemination of the electronic health record (EHR), there are expectations that the public will benefit from biomedical research and discovery enabled by electronic health data. Clinical data are needed for many diseases and conditions to meet the demands of rapidly advancing genomic and proteomic research. Many biomedical research advancements require rapid access to clinical data as well as broad population coverage. A fundamental issue in the secondary use of clinical data for scientific research is the identification of study cohorts of individuals with a disease or medical condition of interest. The problem addressed in this work is the need for generalized, efficient methods to identify cohorts in the EHR for use in biomedical research. To approach this problem, an associative classification framework was designed with the goal of accurate and rapid identification of cases for biomedical research: (1) a set of exemplars for a given medical condition are presented to the framework, (2) a predictive rule set comprised of EHR attributes is generated by the framework, and (3) the rule set is applied to the EHR to identify additional patients that may have the specified condition. iv Based on this functionality, the approach was termed the ‘cohort amplification' framework. The development and evaluation of the cohort amplification framework are the subject of this dissertation. An overview of the framework design is presented. Improvements to some standard associative classification methods are described and validated. A qualitative evaluation of predictive rules to identify diabetes cases and a study of the accuracy of identification of asthma cases in the EHR using frameworkgenerated prediction rules are reported. The framework demonstrated accurate and reliable rules to identify diabetes and asthma cases in the EHR and contributed to methods for identification of biomedical research cohorts
Controlling False Positives in Association Rule Mining
Association rule mining is an important problem in the data mining area. It
enumerates and tests a large number of rules on a dataset and outputs rules
that satisfy user-specified constraints. Due to the large number of rules being
tested, rules that do not represent real systematic effect in the data can
satisfy the given constraints purely by random chance. Hence association rule
mining often suffers from a high risk of false positive errors. There is a lack
of comprehensive study on controlling false positives in association rule
mining. In this paper, we adopt three multiple testing correction
approaches---the direct adjustment approach, the permutation-based approach and
the holdout approach---to control false positives in association rule mining,
and conduct extensive experiments to study their performance. Our results show
that (1) Numerous spurious rules are generated if no correction is made. (2)
The three approaches can control false positives effectively. Among the three
approaches, the permutation-based approach has the highest power of detecting
real association rules, but it is very computationally expensive. We employ
several techniques to reduce its cost effectively.Comment: VLDB201
On the Discovery of Significant Motifs in Genomic Sequences
In this thesis we study the statistical properties of some families of motifs of the same length. We develop a method for the approximation of the average number of frequent motifs in the family in random texts with independent characters. We give a bound on the approximation error and show that this bound is loose in practice. We develop a test which verifies whether the number of frequent motifs can be approximated to a Poisson distributio
A comparison among interpretative proposals for Random Forests
The growing success of Machine Learning (ML) is making significant improvements to predictive models, facilitating their integration in various application fields. Despite its growing success, there are some limitations and disadvantages: the most significant is the lack of interpretability that does not allow users to understand how particular decisions are made. Our study focus on one of the best performing and most used models in the Machine Learning framework, the Random Forest model. It is known as an efficient model of ensemble learning, as it ensures high predictive precision, flexibility, and immediacy; it is recognized as an intuitive and understandable approach to the construction process, but it is also considered a Black Box model due to the large number of deep decision trees produced within it.
The aim of this research is twofold. We present a survey about interpretative proposal for Random Forest and then we perform a machine learning experiment providing a comparison between two methodologies, inTrees, and NodeHarvest, that represent the main approaches in the rule extraction framework. The proposed experiment compares methods performance on six real datasets covering different data characteristics: n. of observations, balanced/unbalanced response, the presence of categorical and numerical predictors. This study contributes to picture a review of the methods and tools proposed for ensemble tree interpretation, and identify, in the class of rule extraction approaches, the best proposal
- …