Search CORE

60,282 research outputs found

Directional Decision Lists

Author: Goessling Marc
Kang Shan
Publication venue
Publication date: 10/01/2016
Field of study

In this paper we introduce a novel family of decision lists consisting of highly interpretable models which can be learned efficiently in a greedy manner. The defining property is that all rules are oriented in the same direction. Particular examples of this family are decision lists with monotonically decreasing (or increasing) probabilities. On simulated data we empirically confirm that the proposed model family is easier to train than general decision lists. We exemplify the practical usability of our approach by identifying problem symptoms in a manufacturing process.Comment: IEEE Big Data for Advanced Manufacturin

arXiv.org e-Print Archive

Crossref

Computationally efficient induction of classification rules with the PMCRI and J-PMCRI frameworks

Author: Berrar
Bramer
Bramer
Bramer
Bramer
Bramer
Cendrowska
Cohen
Corkill
Frederic Stahl
Hennessy
Hunt
Hwang
Jiang
Max Bramer
Michalski
Mutlu
Nolle
Pham
Provost
Quinlan
Quinlan
Smyth
Stahl
Stahl
Stahl
Stahl
Szalay
Witten
Xavier
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

In order to gain knowledge from large databases, scalable data mining technologies are needed. Data are captured on a large scale and thus databases are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classiﬁcation rule induction, parallelisation of classiﬁcation rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classiﬁcation rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classiﬁcation rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach.are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classiﬁcation rule induction, parallelisation of classiﬁcation rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classiﬁcation rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classiﬁcation rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach

Central Archive at the University of Reading

Crossref

Bournemouth University Research Online

A Survey of Parallel Data Mining

Author: Freitas Alex A.
Publication venue
Publication date
Field of study

With the fast, continuous increase in the number and size of databases, parallel data mining is a natural and cost-effective approach to tackle the problem of scalability in data mining. Recently there has been a considerable research on parallel data mining. However, most projects focus on the parallelization of a single kind of data mining algorithm/paradigm. This paper surveys parallel data mining with a broader perspective. More precisely, we discuss the parallelization of data mining algorithms of four knowledge discovery paradigms, namely rule induction, instance-based learning, genetic algorithms and neural networks. Using the lessons learned from this discussion, we also derive a set of heuristic principles for designing efficient parallel data mining algorithms

Kent Academic Repository

A Fast Algorithm Finding the Shortest Reset Words

Author: A. Roman
A.N. Trahtman
A.N. Trahtman
A.N. Trahtman
D. Ananichev
D. Ananichev
D. Eppstein
D.R. Morrison
E. Skvortsov
E. Skvortsov
F. Panneton
H. Jürgensen
J. Kari
J. Olschewski
J. Černý
K. Chmiel
L. Devroye
M. Gerbush
M.V. Volkov
P. Higgins
R. Kudłacik
S. Sandberg
Y. Benenson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

In this paper we present a new fast algorithm finding minimal reset words for finite synchronizing automata. The problem is know to be computationally hard, and our algorithm is exponential. Yet, it is faster than the algorithms used so far and it works well in practice. The main idea is to use a bidirectional BFS and radix (Patricia) tries to store and compare resulted subsets. We give both theoretical and practical arguments showing that the branching factor is reduced efficiently. As a practical test we perform an experimental study of the length of the shortest reset word for random automata with

n

states and 2 input letters. We follow Skvorsov and Tipikin, who have performed such a study using a SAT solver and considering automata up to

n=100

states. With our algorithm we are able to consider much larger sample of automata with up to

n=300

states. In particular, we obtain a new more precise estimation of the expected length of the shortest reset word

\approx 2.5\sqrt{n-5}

.Comment: COCOON 2013. The final publication is available at http://link.springer.com/chapter/10.1007%2F978-3-642-38768-5_1

arXiv.org e-Print Archive

Crossref

A review of associative classification mining

Author: Thabtah Fadi
Publication venue
Publication date: 01/01/2007
Field of study

Associative classification mining is a promising approach in data mining that utilizes the association rule discovery techniques to construct classification systems, also known as associative classifiers. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative classification techniques with regards to the above criteria. Finally, future directions in associative classification, such as incremental learning and mining low-quality data sets, are also highlighted in this paper

CiteSeerX

University of Huddersfield Repository

A methodology for the generation of efficient error detection mechanisms

Author: Anand Sarabjot Singh
Arif Saima
Jhumka Arshad
Leeke Matthew
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2011
Field of study

A dependable software system must contain error detection mechanisms and error recovery mechanisms. Software components for the detection of errors are typically designed based on a system specification or the experience of software engineers, with their efficiency typically being measured using fault injection and metrics such as coverage and latency. In this paper, we introduce a methodology for the design of highly efficient error detection mechanisms. The proposed methodology combines fault injection analysis and data mining techniques in order to generate predicates for efficient error detection mechanisms. The results presented demonstrate the viability of the methodology as an approach for the development of efficient error detection mechanisms, as the predicates generated yield a true positive rate of almost 100% and a false positive rate very close to 0% for the detection of failure-inducing states. The main advantage of the proposed methodology over current state-of-the-art approaches is that efficient detectors are obtained by design, rather than by using specification-based detector design or the experience of software engineers

Warwick Research Archives Portal Repository

Improving performance through concept formation and conceptual clustering

Author: Fisher Douglas H.
Publication venue
Publication date
Field of study

Research from June 1989 through October 1992 focussed on concept formation, clustering, and supervised learning for purposes of improving the efficiency of problem-solving, planning, and diagnosis. These projects resulted in two dissertations on clustering, explanation-based learning, and means-ends planning, and publications in conferences and workshops, several book chapters, and journals; a complete Bibliography of NASA Ames supported publications is included. The following topics are studied: clustering of explanations and problem-solving experiences; clustering and means-end planning; and diagnosis of space shuttle and space station operating modes

NASA Technical Reports Server

Interpretable multiclass classification by MDL-based rule lists

Author: Proença Hugo M.
van Leeuwen Matthijs
Publication venue: 'Elsevier BV'
Publication date: 31/10/2019
Field of study

Interpretable classifiers have recently witnessed an increase in attention from the data mining community because they are inherently easier to understand and explain than their more complex counterparts. Examples of interpretable classification models include decision trees, rule sets, and rule lists. Learning such models often involves optimizing hyperparameters, which typically requires substantial amounts of data and may result in relatively large models. In this paper, we consider the problem of learning compact yet accurate probabilistic rule lists for multiclass classification. Specifically, we propose a novel formalization based on probabilistic rule lists and the minimum description length (MDL) principle. This results in virtually parameter-free model selection that naturally allows to trade-off model complexity with goodness of fit, by which overfitting and the need for hyperparameter tuning are effectively avoided. Finally, we introduce the Classy algorithm, which greedily finds rule lists according to the proposed criterion. We empirically demonstrate that Classy selects small probabilistic rule lists that outperform state-of-the-art classifiers when it comes to the combination of predictive performance and interpretability. We show that Classy is insensitive to its only parameter, i.e., the candidate set, and that compression on the training set correlates with classification performance, validating our MDL-based selection criterion

arXiv.org e-Print Archive

Leiden University Scholary Publications