1,169 research outputs found
On Cognitive Preferences and the Plausibility of Rule-based Models
It is conventional wisdom in machine learning and data mining that logical
models such as rule sets are more interpretable than other models, and that
among such rule-based models, simpler models are more interpretable than more
complex ones. In this position paper, we question this latter assumption by
focusing on one particular aspect of interpretability, namely the plausibility
of models. Roughly speaking, we equate the plausibility of a model with the
likeliness that a user accepts it as an explanation for a prediction. In
particular, we argue that, all other things being equal, longer explanations
may be more convincing than shorter ones, and that the predominant bias for
shorter models, which is typically necessary for learning powerful
discriminative models, may not be suitable when it comes to user acceptance of
the learned models. To that end, we first recapitulate evidence for and against
this postulate, and then report the results of an evaluation in a
crowd-sourcing study based on about 3.000 judgments. The results do not reveal
a strong preference for simple rules, whereas we can observe a weak preference
for longer rules in some domains. We then relate these results to well-known
cognitive biases such as the conjunction fallacy, the representative heuristic,
or the recogition heuristic, and investigate their relation to rule length and
plausibility.Comment: V4: Another rewrite of section on interpretability to clarify focus
on plausibility and relation to interpretability, comprehensibility, and
justifiabilit
Robust subgroup discovery
We introduce the problem of robust subgroup discovery, i.e., finding a set of
interpretable descriptions of subsets that 1) stand out with respect to one or
more target attributes, 2) are statistically robust, and 3) non-redundant. Many
attempts have been made to mine either locally robust subgroups or to tackle
the pattern explosion, but we are the first to address both challenges at the
same time from a global modelling perspective. First, we formulate the broad
model class of subgroup lists, i.e., ordered sets of subgroups, for univariate
and multivariate targets that can consist of nominal or numeric variables, and
that includes traditional top-1 subgroup discovery in its definition. This
novel model class allows us to formalise the problem of optimal robust subgroup
discovery using the Minimum Description Length (MDL) principle, where we resort
to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and
numeric targets, respectively. Second, as finding optimal subgroup lists is
NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists
and guarantees that the most significant subgroup found according to the MDL
criterion is added in each iteration, which is shown to be equivalent to a
Bayesian one-sample proportions, multinomial, or t-test between the subgroup
and dataset marginal target distributions plus a multiple hypothesis testing
penalty. We empirically show on 54 datasets that SSD++ outperforms previous
subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ;
submitted to Data Mining and Knowledge Discovery Journa
Efficient learning of large sets of locally optimal classification rules
Conventional rule learning algorithms aim at finding a set of simple rules,
where each rule covers as many examples as possible. In this paper, we argue
that the rules found in this way may not be the optimal explanations for each
of the examples they cover. Instead, we propose an efficient algorithm that
aims at finding the best rule covering each training example in a greedy
optimization consisting of one specialization and one generalization loop.
These locally optimal rules are collected and then filtered for a final rule
set, which is much larger than the sets learned by conventional rule learning
algorithms. A new example is classified by selecting the best among the rules
that cover this example. In our experiments on small to very large datasets,
the approach's average classification accuracy is higher than that of
state-of-the-art rule learning algorithms. Moreover, the algorithm is highly
efficient and can inherently be processed in parallel without affecting the
learned rule set and so the classification accuracy. We thus believe that it
closes an important gap for large-scale classification rule induction.Comment: article, 40 pages, Machine Learning journal (2023
The use of data-mining for the automatic formation of tactics
This paper discusses the usse of data-mining for the automatic formation of tactics. It was presented at the Workshop on Computer-Supported Mathematical Theory Development held at IJCAR in 2004. The aim of this project is to evaluate the applicability of data-mining techniques to the automatic formation of tactics from large corpuses of proofs. We data-mine information from large proof corpuses to find commonly occurring patterns. These patterns are then evolved into tactics using genetic programming techniques
OWL-Miner: Concept Induction in OWL Knowledge Bases
The Resource Description Framework (RDF) and Web Ontology
Language (OWL)
have been widely used in recent years, and automated methods for
the analysis of
data and knowledge directly within these formalisms are of
current interest. Concept
induction is a technique for discovering descriptions of data,
such as inducing OWL
class expressions to describe RDF data. These class expressions
capture patterns in
the data which can be used to characterise interesting clusters
or to act as classifica-
tion rules over unseen data. The semantics of OWL is underpinned
by Description
Logics (DLs), a family of expressive and decidable fragments of
first-order logic.
Recently, methods of concept induction which are well studied in
the field of
Inductive Logic Programming have been applied to the related
formalism of DLs.
These methods have been developed for a number of purposes
including unsuper-
vised clustering and supervised classification. Refinement-based
search is a concept
induction technique which structures the search space of DL
concept/OWL class
expressions and progressively generalises or specialises
candidate concepts to cover
example data as guided by quality criteria such as accuracy.
However, the current
state-of-the-art in this area is limited in that such methods:
were not primarily de-
signed to scale over large RDF/OWL knowledge bases; do not
support class lan-
guages as expressive as OWL2-DL; or, are limited to one purpose,
such as learning
OWL classes for integration into ontologies. Our work addresses
these limitations
by increasing the efficiency of these learning methods whilst
permitting a concept
language up to the expressivity of OWL2-DL classes. We describe
methods which
support both classification (predictive induction) and subgroup
discovery (descrip-
tive induction), which, in this context, are fundamentally
related.
We have implemented our methods as the system called OWL-Miner
and show
by evaluation that our methods outperform state-of-the-art
systems for DL learning
in both the quality of solutions found and the speed in which
they are computed.
Furthermore, we achieve the best ever ten-fold cross validation
accuracy results on
the long-standing benchmark problem of carcinogenesis. Finally,
we present a case
study on ongoing work in the application of OWL-Miner to a
real-world problem
directed at improving the efficiency of biological macromolecular
crystallisation
Diskretointi osajoukkojen haussa
Subgroup discovery is a data mining technique to discoverer interesting subgroups from a selected population. It seeks to discover interesting relationships between different objects in a set with respect to a specific property. The discovered patterns are called subgroups and they are represented in the form of rules. Discretization is technique to replace numerical attributes with nominal ones, making it possible to use them with algorithms that do not support numerical attributes.
In this thesis two datasets are discretized for the application of subgroup discovery. For the discretizations four different methods were used and three different bin amounts were applied. The used datasets are the heart disease and the Australian credit approval from the UCI Machine Learning Repository. The subgroup discovery technique produced eleven subgroups sets as result, eight from heart disease dataset and three from Australian credit approval dataset. We observed that the bin amount affects greatly on the results. Also, with the binary discretization there are subgroup sets with a high share of subgroups with discretized attributes. In addition, the importance of expert guidance is emphasized
Anytime Discovery of a Diverse Set of Patterns with Monte Carlo Tree Search
International audienceThe discovery of patterns that accurately discriminate one class label from another remains a challenging data mining task. Subgroup discovery (SD) is one of the frameworks that enables to elicit such interesting patterns from labeled data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It out-performs other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern mining tasks
- …