32,078 research outputs found
Computing Multi-Relational Sufficient Statistics for Large Databases
Databases contain information about which relationships do and do not hold
among entities. To make this information accessible for statistical analysis
requires computing sufficient statistics that combine information from
different database tables. Such statistics may involve any number of {\em
positive and negative} relationships. With a naive enumeration approach,
computing sufficient statistics for negative relationships is feasible only for
small databases. We solve this problem with a new dynamic programming algorithm
that performs a virtual join, where the requisite counts are computed without
materializing join tables. Contingency table algebra is a new extension of
relational algebra, that facilitates the efficient implementation of this
M\"obius virtual join operation. The M\"obius Join scales to large datasets
(over 1M tuples) with complex schemas. Empirical evaluation with seven
benchmark datasets showed that information about the presence and absence of
links can be exploited in feature selection, association rule mining, and
Bayesian network learning.Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai,
Chin
Interpretable multiclass classification by MDL-based rule lists
Interpretable classifiers have recently witnessed an increase in attention
from the data mining community because they are inherently easier to understand
and explain than their more complex counterparts. Examples of interpretable
classification models include decision trees, rule sets, and rule lists.
Learning such models often involves optimizing hyperparameters, which typically
requires substantial amounts of data and may result in relatively large models.
In this paper, we consider the problem of learning compact yet accurate
probabilistic rule lists for multiclass classification. Specifically, we
propose a novel formalization based on probabilistic rule lists and the minimum
description length (MDL) principle. This results in virtually parameter-free
model selection that naturally allows to trade-off model complexity with
goodness of fit, by which overfitting and the need for hyperparameter tuning
are effectively avoided. Finally, we introduce the Classy algorithm, which
greedily finds rule lists according to the proposed criterion. We empirically
demonstrate that Classy selects small probabilistic rule lists that outperform
state-of-the-art classifiers when it comes to the combination of predictive
performance and interpretability. We show that Classy is insensitive to its
only parameter, i.e., the candidate set, and that compression on the training
set correlates with classification performance, validating our MDL-based
selection criterion
An enhanced intelligent database engine by neural network and data mining
An Intelligent Database Engine (IDE) is developed to solve any classification problem by providing two integrated features: decision-making by a backpropagation (BP) neural network (NN) and decision support by Apriori, a data mining (DM) algorithm. Previous experimental results show the accuracy of NN (90%) and DM (60%) to be drastically distinct. Thus, efforts to improve DM accuracy is crucial to ensure a well-balanced hybrid architecture. The poor DM performance is caused by either too few rules or too many poor rules which are generated in the classifier. Thus, the first problem is curbed by generating multiple level rules, by incorporating multiple attribute support and level confidence to the initial Apriori. The second problem is tackled by implementing two strengthening procedures, confidence and Bayes verification to filter out the unpredictive rules. Experiments with more datasets are carried out to compare the performance of initial and improved Apriori. Great improvement is obtained for the latte
Medical data mining using evolutionary computation.
by Ngan Po Shun.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 109-115).Abstract also in Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining --- p.1Chapter 1.2 --- Motivation --- p.4Chapter 1.3 --- Contributions of the research --- p.5Chapter 1.4 --- Organization of the thesis --- p.6Chapter 2 --- Related Work in Data Mining --- p.9Chapter 2.1 --- Decision Tree Approach --- p.9Chapter 2.1.1 --- ID3 --- p.10Chapter 2.1.2 --- C4.5 --- p.11Chapter 2.2 --- Classification Rule Learning --- p.13Chapter 2.2.1 --- AQ algorithm --- p.13Chapter 2.2.2 --- CN2 --- p.14Chapter 2.2.3 --- C4.5RULES --- p.16Chapter 2.3 --- Association Rule Mining --- p.16Chapter 2.3.1 --- Apriori --- p.17Chapter 2.3.2 --- Quantitative Association Rule Mining --- p.18Chapter 2.4 --- Statistical Approach --- p.19Chapter 2.4.1 --- Chi Square Test and Bayesian Classifier --- p.19Chapter 2.4.2 --- FORTY-NINER --- p.21Chapter 2.4.3 --- EXPLORA --- p.22Chapter 2.5 --- Bayesian Network Learning --- p.23Chapter 2.5.1 --- Learning Bayesian Networks using the Minimum Descrip- tion Length (MDL) Principle --- p.24Chapter 2.5.2 --- Discretizating Continuous Attributes while Learning Bayesian Networks --- p.26Chapter 3 --- Overview of Evolutionary Computation --- p.29Chapter 3.1 --- Evolutionary Computation --- p.29Chapter 3.1.1 --- Genetic Algorithm --- p.30Chapter 3.1.2 --- Genetic Programming --- p.32Chapter 3.1.3 --- Evolutionary Programming --- p.34Chapter 3.1.4 --- Evolution Strategy --- p.37Chapter 3.1.5 --- Selection Methods --- p.38Chapter 3.2 --- Generic Genetic Programming --- p.39Chapter 3.3 --- Data mining using Evolutionary Computation --- p.43Chapter 4 --- Applying Generic Genetic Programming for Rule Learning --- p.45Chapter 4.1 --- Grammar --- p.46Chapter 4.2 --- Population Creation --- p.49Chapter 4.3 --- Genetic Operators --- p.50Chapter 4.4 --- Evaluation of Rules --- p.52Chapter 5 --- Learning Multiple Rules from Data --- p.56Chapter 5.1 --- Previous approaches --- p.57Chapter 5.1.1 --- Preselection --- p.57Chapter 5.1.2 --- Crowding --- p.57Chapter 5.1.3 --- Deterministic Crowding --- p.58Chapter 5.1.4 --- Fitness sharing --- p.58Chapter 5.2 --- Token Competition --- p.59Chapter 5.3 --- The Complete Rule Learning Approach --- p.61Chapter 5.4 --- Experiments with Machine Learning Databases --- p.64Chapter 5.4.1 --- Experimental results on the Iris Plant Database --- p.65Chapter 5.4.2 --- Experimental results on the Monk Database --- p.67Chapter 6 --- Bayesian Network Learning --- p.72Chapter 6.1 --- The MDLEP Learning Approach --- p.73Chapter 6.2 --- Learning of Discretization Policy by Genetic Algorithm --- p.74Chapter 6.2.1 --- Individual Representation --- p.76Chapter 6.2.2 --- Genetic Operators --- p.78Chapter 6.3 --- Experimental Results --- p.79Chapter 6.3.1 --- Experiment 1 --- p.80Chapter 6.3.2 --- Experiment 2 --- p.82Chapter 6.3.3 --- Experiment 3 --- p.83Chapter 6.3.4 --- Comparison between the GA approach and the greedy ap- proach --- p.91Chapter 7 --- Medical Data Mining System --- p.93Chapter 7.1 --- A Case Study on the Fracture Database --- p.95Chapter 7.1.1 --- Results of Causality and Structure Analysis --- p.95Chapter 7.1.2 --- Results of Rule Learning --- p.97Chapter 7.2 --- A Case Study on the Scoliosis Database --- p.100Chapter 7.2.1 --- Results of Causality and Structure Analysis --- p.100Chapter 7.2.2 --- Results of Rule Learning --- p.102Chapter 8 --- Conclusion and Future Work --- p.106Bibliography --- p.109Chapter A --- The Rule Sets Discovered --- p.116Chapter A.1 --- The Best Rule Set Learned from the Iris Database --- p.116Chapter A.2 --- The Best Rule Set Learned from the Monk Database --- p.116Chapter A.2.1 --- Monkl --- p.116Chapter A.2.2 --- Monk2 --- p.117Chapter A.2.3 --- Monk3 --- p.119Chapter A.3 --- The Best Rule Set Learned from the Fracture Database --- p.120Chapter A.3.1 --- Type I Rules: About Diagnosis --- p.120Chapter A.3.2 --- Type II Rules : About Operation/Surgeon --- p.120Chapter A.3.3 --- Type III Rules : About Stay --- p.122Chapter A.4 --- The Best Rule Set Learned from the Scoliosis Database --- p.123Chapter A.4.1 --- Rules for Classification --- p.123Chapter A.4.2 --- Rules for Treatment --- p.126Chapter B --- The Grammar used for the fracture and Scoliosis databases --- p.128Chapter B.1 --- The grammar for the fracture database --- p.128Chapter B.2 --- The grammar for the Scoliosis database --- p.12
- …