101,811 research outputs found
On-line sampling methods for discovering association rules
Association rule discovery is one of the prototypical problems in
data mining. In this problem, the input database is assumed to be very
large and most of the algorithms are designed to minimize the number
of scans of the database. Enumerating association rules is usually an
expensive task due to the size of the input database. A proposed
approach for reducing the running time of this process is random
sampling. Of course, any implementation of
an algorithm that uses sampling must solve the problem of determining
which sample size is appropriate. Previous research of sampling for
association rule mining has approached this problem concluding that,
in general, the theoretically obtained sample size bounds are far from
what is observed in practice. In this paper, we try to reduce this
gap between theory and practice. We propose two on-line sampling
algorithms for association rule mining. Our algorithms maintain the
same theoretical guarantees of previous approaches while using a much
smaller number of transactions in most of the cases. In the experiments
we report, this improvement is often by an order of magnitude.Postprint (published version
Some Aspects on Data Modelling
Statistical methods are motivated by the desire of learning from data. Transaction dataset and time-ordered data sequence are commonly found in many research areas, such as finance, bioinformatics and text mining. In this dissertation, two problems regarding these two types of data: association rule mining from transaction data and structural change estimation in time-ordered sequence, are studied.
Informative association rule mining is fundamental for knowledge discovery from transaction data, for which brute-force search algorithms, e.g., the well-known Apriori algorithm, were developed. However, operating these algorithms becomes computationally intractable in searching large rule space. A stochastic search framework is developed to tackle this challenge by imposing a probability distribution on the association rule space and using the idea of annealing Gibbs sampling. Large rule space of exponential order can still be randomly searched by this algorithm to generate a Markov chain of viable length. This chain contains the most informative rules with probability one. The stochastic search algorithm is flexible to incorporate any measure of interest. Moreover, it reduces computational complexities and large memory requirements.
A time-ordered data sequence may contain some sudden changes at some time points, before and after which the data sequences follow different distributions or statistical models. Change point problems in generalized linear models and distributions of independent random variables are studied respectively. Firstly, to estimate multiple change points in generalized linear models, we convert it into a model selection problem. Then modern model selection techniques are applied to estimate the regression coefficients. A consistent estimator of the number of change points is developed, and an algorithm is provided to estimate the change points. Secondly, to estimate single change point in distributions of independent random variables, a change point estimator is proposed based on empirical characteristic functions. Its consistency is also established
Investigation of discovering rules from data.
by Ng, King Kwok.Thesis submitted in: December 1999.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 99-104).Abstracts in English and Chinese.Acknowledgments --- p.iiAbstract --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining and Rule Discovery --- p.1Chapter 1.1.1 --- Association Rule --- p.3Chapter 1.1.2 --- Sequential Pattern --- p.4Chapter 1.1.3 --- Dependence Rule --- p.6Chapter 1.2 --- Association Rule Mining --- p.8Chapter 1.3 --- Contributions --- p.9Chapter 1.4 --- Outline of the Thesis --- p.10Chapter 2 --- Related Work on Association Rule Mining --- p.11Chapter 2.1 --- Batch Algorithms --- p.11Chapter 2.1.1 --- The Apriori Algorithm --- p.11Chapter 2.1.2 --- The DIC Algorithm --- p.13Chapter 2.1.3 --- The Partition Algorithm --- p.15Chapter 2.1.4 --- The Sampling Algorithm --- p.15Chapter 2.2 --- Incremental Association Rule Mining --- p.16Chapter 2.2.1 --- The FUP Algorithm --- p.17Chapter 2.2.2 --- The FUP2 Algorithm --- p.18Chapter 2.2.3 --- The FUP* Algorithm --- p.19Chapter 2.2.4 --- The Negative Border Method --- p.20Chapter 2.2.5 --- Limitations of Existing Incremental Association Rule Mining Algorithms --- p.21Chapter 3 --- A New Incremental Association Rule Mining Approach --- p.23Chapter 3.1 --- Outline for the Proposed Approach --- p.23Chapter 3.2 --- Our New Approach --- p.26Chapter 3.2.1 --- The IDIC_M Algorithm --- p.26Chapter 3.2.2 --- A Variant Algorithm: The IDIC_S Algorithm --- p.29Chapter 3.3 --- Performance Evaluation of Our Approach --- p.30Chapter 3.3.1 --- Experimental Results for Algorithm IDIC_M --- p.30Chapter 3.3.2 --- Experimental Results for Algorithm IDIC_S --- p.35Chapter 3.4 --- Discussion --- p.39Chapter 4 --- Related Work on Multiple_Level AR and Belief-Driven Mining --- p.41Chapter 4.1 --- Background on Multiple_Level Association Rules --- p.41Chapter 4.2 --- Related Work on Multiple-Level Association Rules --- p.42Chapter 4.2.1 --- The Basic Algorithm --- p.42Chapter 4.2.2 --- The Cumulate Algorithm --- p.44Chapter 4.2.3 --- The EstMerge Algorithm --- p.44Chapter 4.2.4 --- Using Hierarchy-Information Encoded Transaction Table --- p.45Chapter 4.3 --- Background on Rule Mining in the Presence of User Belief --- p.46Chapter 4.4 --- Related Work on Rule Mining in the Presence of User Belief --- p.47Chapter 4.4.1 --- Post-Analysis of Learned Rules --- p.47Chapter 4.4.2 --- Using General Impressions to Analyze Discovered Classification Rules --- p.49Chapter 4.4.3 --- A Belief-Driven Method for Discovering Unexpected Patterns --- p.50Chapter 4.4.4 --- Constraint-Based Rule Mining --- p.51Chapter 4.5 --- Limitations of Existing Approaches --- p.52Chapter 5 --- Multiple-Level Association Rules Mining in the Presence of User Belief --- p.54Chapter 5.1 --- User Belief Under Taxonomy --- p.55Chapter 5.2 --- Formal Definition of Rule Interestingness --- p.57Chapter 5.3 --- The MARUB_E Mining Algorithm --- p.61Chapter 6 --- Experiments on MARUB_E --- p.64Chapter 6.1 --- Preliminary Experiments --- p.64Chapter 6.2 --- Experiments on Synthetic Data --- p.68Chapter 6.3 --- Experiments on Real Data --- p.71Chapter 7 --- Dealing with Vague Belief of User --- p.76Chapter 7.1 --- User Belief Under Taxonomy --- p.76Chapter 7.2 --- Relationship with Constraint-Based Rule Mining --- p.79Chapter 7.3 --- Formal Definition of Rule Interestingness --- p.79Chapter 7.4 --- The MARUB_V Mining Algorithm --- p.81Chapter 8 --- Experiments on MARUB_V --- p.84Chapter 8.1 --- Preliminary Experiments --- p.84Chapter 8.1.1 --- Experiments on Synthetic Data --- p.87Chapter 8.1.2 --- Experiments on Real Data --- p.93Chapter 9 --- Conclusions and Future Work --- p.96Chapter 9.1 --- Conclusions --- p.95Chapter 9.2 --- Future Work --- p.9
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
The tasks of extracting (top-) Frequent Itemsets (FI's) and Association
Rules (AR's) are fundamental primitives in data mining and database
applications. Exact algorithms for these problems exist and are widely used,
but their running time is hindered by the need of scanning the entire dataset,
possibly multiple times. High quality approximations of FI's and AR's are
sufficient for most practical uses, and a number of recent works explored the
application of sampling for fast discovery of approximate solutions to the
problems. However, these works do not provide satisfactory performance
guarantees on the quality of the approximation, due to the difficulty of
bounding the probability of under- or over-sampling any one of an unknown
number of frequent itemsets. In this work we circumvent this issue by applying
the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop
a novel technique for providing tight bounds on the sample size that guarantees
approximation within user-specified parameters. Our technique applies both to
absolute and to relative approximations of (top-) FI's and AR's. The
resulting sample size is linearly dependent on the VC-dimension of a range
space associated with the dataset to be mined. The main theoretical
contribution of this work is a proof that the VC-dimension of this range space
is upper bounded by an easy-to-compute characteristic quantity of the dataset
which we call \emph{d-index}, and is the maximum integer such that the
dataset contains at least transactions of length at least such that no
one of them is a superset of or equal to another. We show that this bound is
strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the
proceedings of ECML PKDD 201
Too Trivial To Test? An Inverse View on Defect Prediction to Identify Methods with Low Fault Risk
Background. Test resources are usually limited and therefore it is often not
possible to completely test an application before a release. To cope with the
problem of scarce resources, development teams can apply defect prediction to
identify fault-prone code regions. However, defect prediction tends to low
precision in cross-project prediction scenarios.
Aims. We take an inverse view on defect prediction and aim to identify
methods that can be deferred when testing because they contain hardly any
faults due to their code being "trivial". We expect that characteristics of
such methods might be project-independent, so that our approach could improve
cross-project predictions.
Method. We compute code metrics and apply association rule mining to create
rules for identifying methods with low fault risk. We conduct an empirical
study to assess our approach with six Java open-source projects containing
precise fault data at the method level.
Results. Our results show that inverse defect prediction can identify approx.
32-44% of the methods of a project to have a low fault risk; on average, they
are about six times less likely to contain a fault than other methods. In
cross-project predictions with larger, more diversified training sets,
identified methods are even eleven times less likely to contain a fault.
Conclusions. Inverse defect prediction supports the efficient allocation of
test resources by identifying methods that can be treated with less priority in
testing activities and is well applicable in cross-project prediction
scenarios.Comment: Submitted to PeerJ C
Evolving temporal fuzzy association rules from quantitative data with a multi-objective evolutionary algorithm
A novel method for mining association rules that are both quantitative and temporal using a multi-objective evolutionary algorithm is presented. This method successfully identifies numerous temporal association rules that occur more frequently in areas of a dataset with specific quantitative values represented with fuzzy sets. The novelty of this research lies in exploring the composition of quantitative and temporal fuzzy association rules and the approach of using a hybridisation of a multi-objective evolutionary algorithm with fuzzy sets. Results show the ability of a multi-objective evolutionary algorithm (NSGA-II) to evolve multiple target itemsets that have been augmented into synthetic datasets
- …