Search CORE

101,811 research outputs found

On-line sampling methods for discovering association rules

Author: Domingo Soriano Carlos
Gavaldà Mestre Ricard
Watanabe Osamu
Publication venue
Publication date: 01/01/1999
Field of study

Association rule discovery is one of the prototypical problems in data mining. In this problem, the input database is assumed to be very large and most of the algorithms are designed to minimize the number of scans of the database. Enumerating association rules is usually an expensive task due to the size of the input database. A proposed approach for reducing the running time of this process is random sampling. Of course, any implementation of an algorithm that uses sampling must solve the problem of determining which sample size is appropriate. Previous research of sampling for association rule mining has approached this problem concluding that, in general, the theoretically obtained sample size bounds are far from what is observed in practice. In this paper, we try to reduce this gap between theory and practice. We propose two on-line sampling algorithms for association rule mining. Our algorithms maintain the same theoretical guarantees of previous approaches while using a much smaller number of transactions in most of the cases. In the experiments we report, this improvement is often by an order of magnitude.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Some Aspects on Data Modelling

Author: Sun Xiaoying
Publication venue
Publication date: 01/03/2018
Field of study

Statistical methods are motivated by the desire of learning from data. Transaction dataset and time-ordered data sequence are commonly found in many research areas, such as finance, bioinformatics and text mining. In this dissertation, two problems regarding these two types of data: association rule mining from transaction data and structural change estimation in time-ordered sequence, are studied. Informative association rule mining is fundamental for knowledge discovery from transaction data, for which brute-force search algorithms, e.g., the well-known Apriori algorithm, were developed. However, operating these algorithms becomes computationally intractable in searching large rule space. A stochastic search framework is developed to tackle this challenge by imposing a probability distribution on the association rule space and using the idea of annealing Gibbs sampling. Large rule space of exponential order can still be randomly searched by this algorithm to generate a Markov chain of viable length. This chain contains the most informative rules with probability one. The stochastic search algorithm is flexible to incorporate any measure of interest. Moreover, it reduces computational complexities and large memory requirements. A time-ordered data sequence may contain some sudden changes at some time points, before and after which the data sequences follow different distributions or statistical models. Change point problems in generalized linear models and distributions of independent random variables are studied respectively. Firstly, to estimate multiple change points in generalized linear models, we convert it into a model selection problem. Then modern model selection techniques are applied to estimate the regression coefficients. A consistent estimator of the number of change points is developed, and an algorithm is provided to estimate the change points. Secondly, to estimate single change point in distributions of independent random variables, a change point estimator is proposed based on empirical characteristic functions. Its consistency is also established

YorkSpace

Investigation of discovering rules from data.

Author
Publication venue
Publication date: 01/01/2000
Field of study

by Ng, King Kwok.Thesis submitted in: December 1999.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 99-104).Abstracts in English and Chinese.Acknowledgments --- p.iiAbstract --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining and Rule Discovery --- p.1Chapter 1.1.1 --- Association Rule --- p.3Chapter 1.1.2 --- Sequential Pattern --- p.4Chapter 1.1.3 --- Dependence Rule --- p.6Chapter 1.2 --- Association Rule Mining --- p.8Chapter 1.3 --- Contributions --- p.9Chapter 1.4 --- Outline of the Thesis --- p.10Chapter 2 --- Related Work on Association Rule Mining --- p.11Chapter 2.1 --- Batch Algorithms --- p.11Chapter 2.1.1 --- The Apriori Algorithm --- p.11Chapter 2.1.2 --- The DIC Algorithm --- p.13Chapter 2.1.3 --- The Partition Algorithm --- p.15Chapter 2.1.4 --- The Sampling Algorithm --- p.15Chapter 2.2 --- Incremental Association Rule Mining --- p.16Chapter 2.2.1 --- The FUP Algorithm --- p.17Chapter 2.2.2 --- The FUP2 Algorithm --- p.18Chapter 2.2.3 --- The FUP* Algorithm --- p.19Chapter 2.2.4 --- The Negative Border Method --- p.20Chapter 2.2.5 --- Limitations of Existing Incremental Association Rule Mining Algorithms --- p.21Chapter 3 --- A New Incremental Association Rule Mining Approach --- p.23Chapter 3.1 --- Outline for the Proposed Approach --- p.23Chapter 3.2 --- Our New Approach --- p.26Chapter 3.2.1 --- The IDIC_M Algorithm --- p.26Chapter 3.2.2 --- A Variant Algorithm: The IDIC_S Algorithm --- p.29Chapter 3.3 --- Performance Evaluation of Our Approach --- p.30Chapter 3.3.1 --- Experimental Results for Algorithm IDIC_M --- p.30Chapter 3.3.2 --- Experimental Results for Algorithm IDIC_S --- p.35Chapter 3.4 --- Discussion --- p.39Chapter 4 --- Related Work on Multiple_Level AR and Belief-Driven Mining --- p.41Chapter 4.1 --- Background on Multiple_Level Association Rules --- p.41Chapter 4.2 --- Related Work on Multiple-Level Association Rules --- p.42Chapter 4.2.1 --- The Basic Algorithm --- p.42Chapter 4.2.2 --- The Cumulate Algorithm --- p.44Chapter 4.2.3 --- The EstMerge Algorithm --- p.44Chapter 4.2.4 --- Using Hierarchy-Information Encoded Transaction Table --- p.45Chapter 4.3 --- Background on Rule Mining in the Presence of User Belief --- p.46Chapter 4.4 --- Related Work on Rule Mining in the Presence of User Belief --- p.47Chapter 4.4.1 --- Post-Analysis of Learned Rules --- p.47Chapter 4.4.2 --- Using General Impressions to Analyze Discovered Classification Rules --- p.49Chapter 4.4.3 --- A Belief-Driven Method for Discovering Unexpected Patterns --- p.50Chapter 4.4.4 --- Constraint-Based Rule Mining --- p.51Chapter 4.5 --- Limitations of Existing Approaches --- p.52Chapter 5 --- Multiple-Level Association Rules Mining in the Presence of User Belief --- p.54Chapter 5.1 --- User Belief Under Taxonomy --- p.55Chapter 5.2 --- Formal Definition of Rule Interestingness --- p.57Chapter 5.3 --- The MARUB_E Mining Algorithm --- p.61Chapter 6 --- Experiments on MARUB_E --- p.64Chapter 6.1 --- Preliminary Experiments --- p.64Chapter 6.2 --- Experiments on Synthetic Data --- p.68Chapter 6.3 --- Experiments on Real Data --- p.71Chapter 7 --- Dealing with Vague Belief of User --- p.76Chapter 7.1 --- User Belief Under Taxonomy --- p.76Chapter 7.2 --- Relationship with Constraint-Based Rule Mining --- p.79Chapter 7.3 --- Formal Definition of Rule Interestingness --- p.79Chapter 7.4 --- The MARUB_V Mining Algorithm --- p.81Chapter 8 --- Experiments on MARUB_V --- p.84Chapter 8.1 --- Preliminary Experiments --- p.84Chapter 8.1.1 --- Experiments on Synthetic Data --- p.87Chapter 8.1.2 --- Experiments on Real Data --- p.93Chapter 9 --- Conclusions and Future Work --- p.96Chapter 9.1 --- Conclusions --- p.95Chapter 9.2 --- Future Work --- p.9

CUHK Digital Repository

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

Author: Riondato Matteo
Upfal Eli
Publication venue
Publication date: 22/02/2013
Field of study

The tasks of extracting (top-

K

) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-

K

) FI's and AR's. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call \emph{d-index}, and is the maximum integer

d

such that the dataset contains at least

d

transactions of length at least

d

such that no one of them is a superset of or equal to another. We show that this bound is strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the proceedings of ECML PKDD 201

arXiv.org e-Print Archive

CiteSeerX

Too Trivial To Test? An Inverse View on Defect Prediction to Identify Methods with Low Fault Risk

Author: Niedermayr Rainer
Röhm Tobias
Wagner Stefan
Publication venue: 'PeerJ'
Publication date: 02/11/2018
Field of study

Background. Test resources are usually limited and therefore it is often not possible to completely test an application before a release. To cope with the problem of scarce resources, development teams can apply defect prediction to identify fault-prone code regions. However, defect prediction tends to low precision in cross-project prediction scenarios. Aims. We take an inverse view on defect prediction and aim to identify methods that can be deferred when testing because they contain hardly any faults due to their code being "trivial". We expect that characteristics of such methods might be project-independent, so that our approach could improve cross-project predictions. Method. We compute code metrics and apply association rule mining to create rules for identifying methods with low fault risk. We conduct an empirical study to assess our approach with six Java open-source projects containing precise fault data at the method level. Results. Our results show that inverse defect prediction can identify approx. 32-44% of the methods of a project to have a low fault risk; on average, they are about six times less likely to contain a fault than other methods. In cross-project predictions with larger, more diversified training sets, identified methods are even eleven times less likely to contain a fault. Conclusions. Inverse defect prediction supports the efficient allocation of test resources by identifying methods that can be treated with less priority in testing activities and is well applicable in cross-project prediction scenarios.Comment: Submitted to PeerJ C

arXiv.org e-Print Archive

Directory of Open Access Journals

Evolving temporal fuzzy association rules from quantitative data with a multi-objective evolutionary algorithm

Author: C. Carmona
C.A.C. Coello
E. Corchado
K. Deb
M. Kaya
M. Kaya
S.G. Matthews
T.-P. Hong
Y. Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

A novel method for mining association rules that are both quantitative and temporal using a multi-objective evolutionary algorithm is presented. This method successfully identifies numerous temporal association rules that occur more frequently in areas of a dataset with specific quantitative values represented with fuzzy sets. The novelty of this research lies in exploring the composition of quantitative and temporal fuzzy association rules and the approach of using a hybridisation of a multi-objective evolutionary algorithm with fuzzy sets. Results show the ability of a multi-objective evolutionary algorithm (NSGA-II) to evolve multiple target itemsets that have been augmented into synthetic datasets

CiteSeerX

Crossref

Sheffield Hallam University Research Archive

De Montfort University Open Research Archive

Open Repository and Bibliography - Liège

Explore Bristol Research