Search CORE

32,078 research outputs found

Computing Multi-Relational Sufficient Statistics for Large Databases

Author: Qian Zhensong
Schulte Oliver
Sun Yan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/08/2014
Field of study

Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of {\em positive and negative} relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this M\"obius virtual join operation. The M\"obius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with seven benchmark datasets showed that information about the presence and absence of links can be exploited in feature selection, association rule mining, and Bayesian network learning.Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai, Chin

arXiv.org e-Print Archive

CiteSeerX

Interpretable multiclass classification by MDL-based rule lists

Author: Proença Hugo M.
van Leeuwen Matthijs
Publication venue: 'Elsevier BV'
Publication date: 31/10/2019
Field of study

Interpretable classifiers have recently witnessed an increase in attention from the data mining community because they are inherently easier to understand and explain than their more complex counterparts. Examples of interpretable classification models include decision trees, rule sets, and rule lists. Learning such models often involves optimizing hyperparameters, which typically requires substantial amounts of data and may result in relatively large models. In this paper, we consider the problem of learning compact yet accurate probabilistic rule lists for multiclass classification. Specifically, we propose a novel formalization based on probabilistic rule lists and the minimum description length (MDL) principle. This results in virtually parameter-free model selection that naturally allows to trade-off model complexity with goodness of fit, by which overfitting and the need for hyperparameter tuning are effectively avoided. Finally, we introduce the Classy algorithm, which greedily finds rule lists according to the proposed criterion. We empirically demonstrate that Classy selects small probabilistic rule lists that outperform state-of-the-art classifiers when it comes to the combination of predictive performance and interpretability. We show that Classy is insensitive to its only parameter, i.e., the candidate set, and that compression on the training set correlates with classification performance, validating our MDL-based selection criterion

arXiv.org e-Print Archive

Leiden University Scholary Publications

An enhanced intelligent database engine by neural network and data mining

Author: Chua Boon Lay
Khalid Marzuki
Yusof Rubiyah
Publication venue
Publication date: 24/09/2000
Field of study

An Intelligent Database Engine (IDE) is developed to solve any classification problem by providing two integrated features: decision-making by a backpropagation (BP) neural network (NN) and decision support by Apriori, a data mining (DM) algorithm. Previous experimental results show the accuracy of NN (90%) and DM (60%) to be drastically distinct. Thus, efforts to improve DM accuracy is crucial to ensure a well-balanced hybrid architecture. The poor DM performance is caused by either too few rules or too many poor rules which are generated in the classifier. Thus, the first problem is curbed by generating multiple level rules, by incorporating multiple attribute support and level confidence to the initial Apriori. The second problem is tackled by implementing two strengthening procedures, confidence and Bayes verification to filter out the unpredictive rules. Experiments with more datasets are carried out to compare the performance of initial and improved Apriori. Great improvement is obtained for the latte

Universiti Teknologi Malaysia Institutional Repository

Medical data mining using evolutionary computation.

Author
Publication venue
Publication date: 01/01/1998
Field of study

by Ngan Po Shun.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 109-115).Abstract also in Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining --- p.1Chapter 1.2 --- Motivation --- p.4Chapter 1.3 --- Contributions of the research --- p.5Chapter 1.4 --- Organization of the thesis --- p.6Chapter 2 --- Related Work in Data Mining --- p.9Chapter 2.1 --- Decision Tree Approach --- p.9Chapter 2.1.1 --- ID3 --- p.10Chapter 2.1.2 --- C4.5 --- p.11Chapter 2.2 --- Classification Rule Learning --- p.13Chapter 2.2.1 --- AQ algorithm --- p.13Chapter 2.2.2 --- CN2 --- p.14Chapter 2.2.3 --- C4.5RULES --- p.16Chapter 2.3 --- Association Rule Mining --- p.16Chapter 2.3.1 --- Apriori --- p.17Chapter 2.3.2 --- Quantitative Association Rule Mining --- p.18Chapter 2.4 --- Statistical Approach --- p.19Chapter 2.4.1 --- Chi Square Test and Bayesian Classifier --- p.19Chapter 2.4.2 --- FORTY-NINER --- p.21Chapter 2.4.3 --- EXPLORA --- p.22Chapter 2.5 --- Bayesian Network Learning --- p.23Chapter 2.5.1 --- Learning Bayesian Networks using the Minimum Descrip- tion Length (MDL) Principle --- p.24Chapter 2.5.2 --- Discretizating Continuous Attributes while Learning Bayesian Networks --- p.26Chapter 3 --- Overview of Evolutionary Computation --- p.29Chapter 3.1 --- Evolutionary Computation --- p.29Chapter 3.1.1 --- Genetic Algorithm --- p.30Chapter 3.1.2 --- Genetic Programming --- p.32Chapter 3.1.3 --- Evolutionary Programming --- p.34Chapter 3.1.4 --- Evolution Strategy --- p.37Chapter 3.1.5 --- Selection Methods --- p.38Chapter 3.2 --- Generic Genetic Programming --- p.39Chapter 3.3 --- Data mining using Evolutionary Computation --- p.43Chapter 4 --- Applying Generic Genetic Programming for Rule Learning --- p.45Chapter 4.1 --- Grammar --- p.46Chapter 4.2 --- Population Creation --- p.49Chapter 4.3 --- Genetic Operators --- p.50Chapter 4.4 --- Evaluation of Rules --- p.52Chapter 5 --- Learning Multiple Rules from Data --- p.56Chapter 5.1 --- Previous approaches --- p.57Chapter 5.1.1 --- Preselection --- p.57Chapter 5.1.2 --- Crowding --- p.57Chapter 5.1.3 --- Deterministic Crowding --- p.58Chapter 5.1.4 --- Fitness sharing --- p.58Chapter 5.2 --- Token Competition --- p.59Chapter 5.3 --- The Complete Rule Learning Approach --- p.61Chapter 5.4 --- Experiments with Machine Learning Databases --- p.64Chapter 5.4.1 --- Experimental results on the Iris Plant Database --- p.65Chapter 5.4.2 --- Experimental results on the Monk Database --- p.67Chapter 6 --- Bayesian Network Learning --- p.72Chapter 6.1 --- The MDLEP Learning Approach --- p.73Chapter 6.2 --- Learning of Discretization Policy by Genetic Algorithm --- p.74Chapter 6.2.1 --- Individual Representation --- p.76Chapter 6.2.2 --- Genetic Operators --- p.78Chapter 6.3 --- Experimental Results --- p.79Chapter 6.3.1 --- Experiment 1 --- p.80Chapter 6.3.2 --- Experiment 2 --- p.82Chapter 6.3.3 --- Experiment 3 --- p.83Chapter 6.3.4 --- Comparison between the GA approach and the greedy ap- proach --- p.91Chapter 7 --- Medical Data Mining System --- p.93Chapter 7.1 --- A Case Study on the Fracture Database --- p.95Chapter 7.1.1 --- Results of Causality and Structure Analysis --- p.95Chapter 7.1.2 --- Results of Rule Learning --- p.97Chapter 7.2 --- A Case Study on the Scoliosis Database --- p.100Chapter 7.2.1 --- Results of Causality and Structure Analysis --- p.100Chapter 7.2.2 --- Results of Rule Learning --- p.102Chapter 8 --- Conclusion and Future Work --- p.106Bibliography --- p.109Chapter A --- The Rule Sets Discovered --- p.116Chapter A.1 --- The Best Rule Set Learned from the Iris Database --- p.116Chapter A.2 --- The Best Rule Set Learned from the Monk Database --- p.116Chapter A.2.1 --- Monkl --- p.116Chapter A.2.2 --- Monk2 --- p.117Chapter A.2.3 --- Monk3 --- p.119Chapter A.3 --- The Best Rule Set Learned from the Fracture Database --- p.120Chapter A.3.1 --- Type I Rules: About Diagnosis --- p.120Chapter A.3.2 --- Type II Rules : About Operation/Surgeon --- p.120Chapter A.3.3 --- Type III Rules : About Stay --- p.122Chapter A.4 --- The Best Rule Set Learned from the Scoliosis Database --- p.123Chapter A.4.1 --- Rules for Classification --- p.123Chapter A.4.2 --- Rules for Treatment --- p.126Chapter B --- The Grammar used for the fracture and Scoliosis databases --- p.128Chapter B.1 --- The grammar for the fracture database --- p.128Chapter B.2 --- The grammar for the Scoliosis database --- p.12

CUHK Digital Repository