91,114 research outputs found
Techniques for improving clustering and association rules mining from very large transactional databases
Clustering and association rules mining are two core data mining tasks that have been actively studied by data mining community for nearly two decades. Though many clustering and association rules mining algorithms have been developed, no algorithm is better than others on all aspects, such as accuracy, efficiency, scalability, adaptability and memory usage. While more efficient and effective algorithms need to be developed for handling the large-scale and complex stored datasets, emerging applications where data takes the form of streams pose new challenges for the data mining community. The existing techniques and algorithms for static stored databases cannot be applied to the data streams directly. They need to be extended or modified, or new methods need to be developed to process the data streams.In this thesis, algorithms have been developed for improving efficiency and accuracy of clustering and association rules mining on very large, high dimensional, high cardinality, sparse transactional databases and data streams.A new similarity measure suitable for clustering transactional data is defined and an incremental clustering algorithm, INCLUS, is proposed using this similarity measure. The algorithm only scans the database once and produces clusters based on the user’s expectations of similarities between transactions in a cluster, which is controlled by the user input parameters, a similarity threshold and a support threshold. Intensive testing has been performed to evaluate the effectiveness, efficiency, scalability and order insensitiveness of the algorithm.To extend INCLUS for transactional data streams, an equal-width time window model and an elastic time window model are proposed that allow mining of clustering changes in evolving data streams. The minimal width of the window is determined by the minimum clustering granularity for a particular application. Two algorithms, CluStream_EQ and CluStream_EL, based on the equal-width window model and the elastic window model respectively, are developed by incorporating these models into INCLUS. Each algorithm consists of an online micro-clustering component and an offline macro-clustering component. The online component writes summary statistics of a data stream to the disk, and the offline components uses those summaries and other user input to discover changes in a data stream. The effectiveness and scalability of the algorithms are evaluated by experiments.This thesis also looks into sampling techniques that can improve efficiency of mining association rules in a very large transactional database. The sample size is derived based on the binomial distribution and central limit theorem. The sample size used is smaller than that based on Chernoff Bounds, but still provides the same approximation guarantees. The accuracy of the proposed sampling approach is theoretically analyzed and its effectiveness is experimentally evaluated on both dense and sparse datasets.Applications of stratified sampling for association rules mining is also explored in this thesis. The database is first partitioned into strata based on the length of transactions, and simple random sampling is then performed on each stratum. The total sample size is determined by a formula derived in this thesis and the sample size for each stratum is proportionate to the size of the stratum. The accuracy of transaction size based stratified sampling is experimentally compared with that of random sampling.The thesis concludes with a summary of significant contributions and some pointers for further work
Mining Top-K Frequent Itemsets Through Progressive Sampling
We study the use of sampling for efficiently mining the top-K frequent
itemsets of cardinality at most w. To this purpose, we define an approximation
to the top-K frequent itemsets to be a family of itemsets which includes
(resp., excludes) all very frequent (resp., very infrequent) itemsets, together
with an estimate of these itemsets' frequencies with a bounded error. Our first
result is an upper bound on the sample size which guarantees that the top-K
frequent itemsets mined from a random sample of that size approximate the
actual top-K frequent itemsets, with probability larger than a specified value.
We show that the upper bound is asymptotically tight when w is constant. Our
main algorithmic contribution is a progressive sampling approach, combined with
suitable stopping conditions, which on appropriate inputs is able to extract
approximate top-K frequent itemsets from samples whose sizes are smaller than
the general upper bound. In order to test the stopping conditions, this
approach maintains the frequency of all itemsets encountered, which is
practical only for small w. However, we show how this problem can be mitigated
by using a variation of Bloom filters. A number of experiments conducted on
both synthetic and real bench- mark datasets show that using samples
substantially smaller than the original dataset (i.e., of size defined by the
upper bound or reached through the progressive sampling approach) enable to
approximate the actual top-K frequent itemsets with accuracy much higher than
what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and
publication in the ECML PKDD 2010 special issue of the Data Mining and
Knowledge Discovery journa
Too Trivial To Test? An Inverse View on Defect Prediction to Identify Methods with Low Fault Risk
Background. Test resources are usually limited and therefore it is often not
possible to completely test an application before a release. To cope with the
problem of scarce resources, development teams can apply defect prediction to
identify fault-prone code regions. However, defect prediction tends to low
precision in cross-project prediction scenarios.
Aims. We take an inverse view on defect prediction and aim to identify
methods that can be deferred when testing because they contain hardly any
faults due to their code being "trivial". We expect that characteristics of
such methods might be project-independent, so that our approach could improve
cross-project predictions.
Method. We compute code metrics and apply association rule mining to create
rules for identifying methods with low fault risk. We conduct an empirical
study to assess our approach with six Java open-source projects containing
precise fault data at the method level.
Results. Our results show that inverse defect prediction can identify approx.
32-44% of the methods of a project to have a low fault risk; on average, they
are about six times less likely to contain a fault than other methods. In
cross-project predictions with larger, more diversified training sets,
identified methods are even eleven times less likely to contain a fault.
Conclusions. Inverse defect prediction supports the efficient allocation of
test resources by identifying methods that can be treated with less priority in
testing activities and is well applicable in cross-project prediction
scenarios.Comment: Submitted to PeerJ C
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
The tasks of extracting (top-) Frequent Itemsets (FI's) and Association
Rules (AR's) are fundamental primitives in data mining and database
applications. Exact algorithms for these problems exist and are widely used,
but their running time is hindered by the need of scanning the entire dataset,
possibly multiple times. High quality approximations of FI's and AR's are
sufficient for most practical uses, and a number of recent works explored the
application of sampling for fast discovery of approximate solutions to the
problems. However, these works do not provide satisfactory performance
guarantees on the quality of the approximation, due to the difficulty of
bounding the probability of under- or over-sampling any one of an unknown
number of frequent itemsets. In this work we circumvent this issue by applying
the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop
a novel technique for providing tight bounds on the sample size that guarantees
approximation within user-specified parameters. Our technique applies both to
absolute and to relative approximations of (top-) FI's and AR's. The
resulting sample size is linearly dependent on the VC-dimension of a range
space associated with the dataset to be mined. The main theoretical
contribution of this work is a proof that the VC-dimension of this range space
is upper bounded by an easy-to-compute characteristic quantity of the dataset
which we call \emph{d-index}, and is the maximum integer such that the
dataset contains at least transactions of length at least such that no
one of them is a superset of or equal to another. We show that this bound is
strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the
proceedings of ECML PKDD 201
- …