587 research outputs found
Comprehensible and Robust Knowledge Discovery from Small Datasets
Die Wissensentdeckung in Datenbanken (“Knowledge Discovery in Databases”, KDD) zielt darauf ab, nützliches Wissen aus Daten zu extrahieren. Daten können eine Reihe
von Messungen aus einem realen Prozess repräsentieren oder eine Reihe von Eingabe-
Ausgabe-Werten eines Simulationsmodells. Zwei häufig widersprüchliche Anforderungen
an das erworbene Wissen sind, dass es (1) die Daten möglichst exakt zusammenfasst und
(2) in einer gut verständlichen Form vorliegt. Entscheidungsbäume (“Decision Trees”) und
Methoden zur Entdeckung von Untergruppen (“Subgroup Discovery”) liefern Wissenszusammenfassungen in Form von Hyperrechtecken; diese gelten als gut verständlich.
Um die Bedeutung einer verständlichen Datenzusammenfassung zu demonstrieren,
erforschen wir Dezentrale intelligente Netzsteuerung — ein neues System, das die Bedarfsreaktion in Stromnetzen ohne wesentliche Änderungen in der Infrastruktur implementiert.
Die bisher durchgeführte konventionelle Analyse dieses Systems beschränkte sich auf
die Berücksichtigung identischer Teilnehmer und spiegelte daher die Realität nicht ausreichend gut wider. Wir führen viele Simulationen mit unterschiedlichen Eingabewerten durch und wenden Entscheidungsbäume auf die resultierenden Daten an. Mit den daraus resultierenden verständlichen Datenzusammenfassung konnten wir neue Erkenntnisse zum Verhalten der Dezentrale intelligente Netzsteuerung gewinnen.
Entscheidungsbäume ermöglichen die Beschreibung des Systemverhaltens für alle Eingabekombinationen.
Manchmal ist man aber nicht daran interessiert, den gesamten Eingaberaum
zu partitionieren, sondern Bereiche zu finden, die zu bestimmten Ausgabe fĂĽhren
(sog. Untergruppen). Die vorhandenen Algorithmen zum Erkennen von Untergruppen
erfordern normalerweise groĂźe Datenmengen, um eine stabile und genaue Ausgabe zu erzielen.
Der Datenerfassungsprozess ist jedoch häufig kostspielig. Unser Hauptbeitrag ist die
Verbesserung der Untergruppenerkennung aus Datensätzen mit wenigen Beobachtungen.
Die Entdeckung von Untergruppen in simulierten Daten wird als Szenarioerkennung
bezeichnet. Ein häufig verwendeter Algorithmus für die Szenarioerkennung ist PRIM
(Patient Rule Induction Method). Wir schlagen REDS (Rule Extraction for Discovering
Scenarios) vor, ein neues Verfahren fĂĽr die Szenarioerkennung. FĂĽr REDS, trainieren wir
zuerst ein statistisches Zwischenmodell und verwenden dieses, um eine groĂźe Menge
neuer Daten fĂĽr PRIM zu erstellen. Die grundlegende statistische Intuition beschrieben wir
ebenfalls. Experimente zeigen, dass REDS viel besser funktioniert als PRIM fĂĽr sich alleine:
Es reduziert die Anzahl der erforderlichen Simulationsläufe um 75% im Durchschnitt.
Mit simulierten Daten hat man perfekte Kenntnisse über die Eingangsverteilung — eine
Voraussetzung von REDS. Um REDS auf realen Messdaten anwendbar zu machen, haben
wir es mit Stichproben aus einer geschätzten multivariate Verteilung der Daten kombiniert.
Wir haben die resultierende Methode in Kombination mit verschiedenen Methoden zur Generierung von Daten experimentell evaluiert. Wir haben dies für PRIM und BestInterval — eine weitere repräsentative Methode zur Erkennung von Untergruppen — gemacht. In den meisten Fällen hat unsere Methodik die Qualität der entdeckten Untergruppen erhöht
Multiobjective Evolutionary Induction of Subgroup Discovery Fuzzy Rules: A Case Study in Marketing
This paper presents a multiobjective genetic algorithm which obtains
fuzzy rules for subgroup discovery in disjunctive normal form. This kind of
fuzzy rules lets us represent knowledge about patterns of interest in an
explanatory and understandable form which can be used by the expert. The
evolutionary algorithm follows a multiobjective approach in order to optimize
in a suitable way the different quality measures used in this kind of problems.
Experimental evaluation of the algorithm, applying it to a market problem
studied in the University of MondragĂłn (Spain), shows the validity of the
proposal. The application of the proposal to this problem allows us to obtain
novel and valuable knowledge for the experts.Spanish Ministry of Science and TechnologyFEDER TIC-2005-08386-C05-01 and TIC-2005-
08386-C05-03TIN2004-20061-E and TIN2004-21343-
Learning Interpretable Rules for Multi-label Classification
Multi-label classification (MLC) is a supervised learning problem in which,
contrary to standard multiclass classification, an instance can be associated
with several class labels simultaneously. In this chapter, we advocate a
rule-based approach to multi-label classification. Rule learning algorithms are
often employed when one is not only interested in accurate predictions, but
also requires an interpretable theory that can be understood, analyzed, and
qualitatively evaluated by domain experts. Ideally, by revealing patterns and
regularities contained in the data, a rule-based theory yields new insights in
the application domain. Recently, several authors have started to investigate
how rule-based models can be used for modeling multi-label data. Discussing
this task in detail, we highlight some of the problems that make rule learning
considerably more challenging for MLC than for conventional classification.
While mainly focusing on our own previous work, we also provide a short
overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models
in Computer Vision and Machine Learning. The Springer Series on Challenges in
Machine Learning. Springer (2018). See
http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further
informatio
On Cognitive Preferences and the Plausibility of Rule-based Models
It is conventional wisdom in machine learning and data mining that logical
models such as rule sets are more interpretable than other models, and that
among such rule-based models, simpler models are more interpretable than more
complex ones. In this position paper, we question this latter assumption by
focusing on one particular aspect of interpretability, namely the plausibility
of models. Roughly speaking, we equate the plausibility of a model with the
likeliness that a user accepts it as an explanation for a prediction. In
particular, we argue that, all other things being equal, longer explanations
may be more convincing than shorter ones, and that the predominant bias for
shorter models, which is typically necessary for learning powerful
discriminative models, may not be suitable when it comes to user acceptance of
the learned models. To that end, we first recapitulate evidence for and against
this postulate, and then report the results of an evaluation in a
crowd-sourcing study based on about 3.000 judgments. The results do not reveal
a strong preference for simple rules, whereas we can observe a weak preference
for longer rules in some domains. We then relate these results to well-known
cognitive biases such as the conjunction fallacy, the representative heuristic,
or the recogition heuristic, and investigate their relation to rule length and
plausibility.Comment: V4: Another rewrite of section on interpretability to clarify focus
on plausibility and relation to interpretability, comprehensibility, and
justifiabilit
Analysing the Moodle e-learning platform through subgroup discovery algorithms based on evolutionary fuzzy systems
Nowadays, there is a increasing in the use of learning management systems
from the universities. This type of systems are also known under other
di erent terms as course management systems or learning content management
systems. Speci cally, these systems are e-learning platforms o ering
di erent facilities for information sharing and communication between the
participants in the e-learning process.
This contribution presents an experimental study with several subgroup
discovery algorithms based on evolutionary fuzzy systems using data from a
web-based education system. The main objective of this contribution is to
extract unusual subgroups to describe possible relationships between the use
of the e-learning platform and marks obtained by the students. The results
obtained by the best performing algorithm, NMEEF-SD, are also presented.
The most representative results obtained by this algorithm are summarised in
order to obtain knowledge that can allow teachers to take actions to improve student performance
Multi Objective Optimization of classification rules using Cultural Algorithms
AbstractClassification rule mining is the most sought out by users since they represent highly comprehensible form of knowledge. The rules are evaluated based on objective and subjective metrics. The user must be able to specify the properties of the rules. The rules discovered must have some of these properties to render them useful. These properties may be conflicting. Hence discovery of rules with specific properties is a multi objective optimization problem. Cultural Algorithm (CA) which derives from social structures, and which incorporates evolutionary systems and agents, and uses five knowledge sources (KS's) for the evolution process better suits the need for solving multi objective optimization problem. In the current study a cultural algorithm for classification rule mining is proposed for multi objective optimization of rules
Searching for rules to detect defective modules: A subgroup discovery approach
Data mining methods in software engineering are becoming increasingly important as they
can support several aspects of the software development life-cycle such as quality. In this
work, we present a data mining approach to induce rules extracted from static software
metrics characterising fault-prone modules. Due to the special characteristics of the defect
prediction data (imbalanced, inconsistency, redundancy) not all classification algorithms
are capable of dealing with this task conveniently. To deal with these problems, Subgroup
Discovery (SD) algorithms can be used to find groups of statistically different data given a
property of interest. We propose EDER-SD (Evolutionary Decision Rules for Subgroup Discovery),
a SD algorithm based on evolutionary computation that induces rules describing
only fault-prone modules. The rules are a well-known model representation that can be
easily understood and applied by project managers and quality engineers. Thus, rules
can help them to develop software systems that can be justifiably trusted. Contrary to
other approaches in SD, our algorithm has the advantage of working with continuous variables
as the conditions of the rules are defined using intervals. We describe the rules
obtained by applying our algorithm to seven publicly available datasets from the PROMISE
repository showing that they are capable of characterising subgroups of fault-prone modules.
We also compare our results with three other well known SD algorithms and the
EDER-SD algorithm performs well in most cases.Ministerio de EducaciĂłn y Ciencia TIN2007-68084-C02-00Ministerio de EducaciĂłn y Ciencia TIN2010-21715-C02-0
Feature Selection and Molecular Classification of Cancer Using Genetic Programming
AbstractDespite important advances in microarray-based molecular classification of tumors, its application in clinical settings remains formidable. This is in part due to the limitation of current analysis programs in discovering robust biomarkers and developing classifiers with a practical set of genes. Genetic programming (GP) is a type of machine learning technique that uses evolutionary algorithm to simulate natural selection as well as population dynamics, hence leading to simple and comprehensible classifiers. Here we applied GP to cancer expression profiling data to select feature genes and build molecular classifiers by mathematical integration of these genes. Analysis of thousands of GP classifiers generated for a prostate cancer data set revealed repetitive use of a set of highly discriminative feature genes, many of which are known to be disease associated. GP classifiers often comprise five or less genes and successfully predict cancer types and subtypes. More importantly, GP classifiers generated in one study are able to predict samples from an independent study, which may have used different microarray platforms. In addition, GP yielded classification accuracy better than or similar to conventional classification methods. Furthermore, the mathematical expression of GP classifiers provides insights into relationships between classifier genes. Taken together, our results demonstrate that GP may be valuable for generating effective classifiers containing a practical set of genes for diagnostic/ prognostic cancer classification
- …