53 research outputs found
Searching for rules to detect defective modules: A subgroup discovery approach
Data mining methods in software engineering are becoming increasingly important as they
can support several aspects of the software development life-cycle such as quality. In this
work, we present a data mining approach to induce rules extracted from static software
metrics characterising fault-prone modules. Due to the special characteristics of the defect
prediction data (imbalanced, inconsistency, redundancy) not all classification algorithms
are capable of dealing with this task conveniently. To deal with these problems, Subgroup
Discovery (SD) algorithms can be used to find groups of statistically different data given a
property of interest. We propose EDER-SD (Evolutionary Decision Rules for Subgroup Discovery),
a SD algorithm based on evolutionary computation that induces rules describing
only fault-prone modules. The rules are a well-known model representation that can be
easily understood and applied by project managers and quality engineers. Thus, rules
can help them to develop software systems that can be justifiably trusted. Contrary to
other approaches in SD, our algorithm has the advantage of working with continuous variables
as the conditions of the rules are defined using intervals. We describe the rules
obtained by applying our algorithm to seven publicly available datasets from the PROMISE
repository showing that they are capable of characterising subgroups of fault-prone modules.
We also compare our results with three other well known SD algorithms and the
EDER-SD algorithm performs well in most cases.Ministerio de Educación y Ciencia TIN2007-68084-C02-00Ministerio de Educación y Ciencia TIN2010-21715-C02-0
Robust subgroup discovery
We introduce the problem of robust subgroup discovery, i.e., finding a set of
interpretable descriptions of subsets that 1) stand out with respect to one or
more target attributes, 2) are statistically robust, and 3) non-redundant. Many
attempts have been made to mine either locally robust subgroups or to tackle
the pattern explosion, but we are the first to address both challenges at the
same time from a global modelling perspective. First, we formulate the broad
model class of subgroup lists, i.e., ordered sets of subgroups, for univariate
and multivariate targets that can consist of nominal or numeric variables, and
that includes traditional top-1 subgroup discovery in its definition. This
novel model class allows us to formalise the problem of optimal robust subgroup
discovery using the Minimum Description Length (MDL) principle, where we resort
to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and
numeric targets, respectively. Second, as finding optimal subgroup lists is
NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists
and guarantees that the most significant subgroup found according to the MDL
criterion is added in each iteration, which is shown to be equivalent to a
Bayesian one-sample proportions, multinomial, or t-test between the subgroup
and dataset marginal target distributions plus a multiple hypothesis testing
penalty. We empirically show on 54 datasets that SSD++ outperforms previous
subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ;
submitted to Data Mining and Knowledge Discovery Journa
Knowledge-based incremental induction of clinical algorithms
The current approaches for the induction of medical procedural knowledge suffer from several drawbacks: the structures produced may not be explicit medical structures, they are only based on statistical measures that do not necessarily respect medical criteria which can be essential to guarantee medical correct structures, or they are not prepared to deal with the incremental arrival of new data.
In this thesis we propose a methodology to automatically induce medically correct clinical algorithms (CAs) from hospital databases. These CAs are represented according to the SDA knowledge model. The methodology considers relevant background knowledge and it is able to work in an incremental way.
The methodology has been tested in the domains of hypertension, diabetes mellitus and the comborbidity of both diseases. As a result, we propose a repository of background knowledge for these pathologies and provide the SDA diagrams obtained. Later analyses show that the results are medically correct and comprehensible when validated with health care professionals
Mining Predictive Patterns and Extension to Multivariate Temporal Data
An important goal of knowledge discovery is the search for patterns in the data that can help explaining its underlying structure. To be practically useful, the discovered patterns should be novel (unexpected) and easy to understand by humans. In this thesis, we study the problem of mining patterns (defining subpopulations of data instances) that are important for predicting and explaining a specific outcome variable. An example is the task of identifying groups of patients that respond better to a certain treatment than the rest of the patients.
We propose and present efficient methods for mining predictive patterns for both atemporal and temporal (time series) data. Our first method relies on frequent pattern mining to explore the search space. It applies a novel evaluation technique for extracting a small set of frequent patterns that are highly predictive and have low redundancy. We show the benefits of this method on several synthetic and public datasets.
Our temporal pattern mining method works on complex multivariate temporal data, such as electronic health records, for the event detection task. It first converts time series into time-interval sequences of temporal abstractions and then mines temporal patterns backwards in time, starting from patterns related to the most recent observations. We show the benefits of our temporal pattern mining method on two real-world clinical tasks
Analyzing Granger causality in climate data with time series classification methods
Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested
- …