8 research outputs found
Robust subgroup discovery
We introduce the problem of robust subgroup discovery, i.e., finding a set of
interpretable descriptions of subsets that 1) stand out with respect to one or
more target attributes, 2) are statistically robust, and 3) non-redundant. Many
attempts have been made to mine either locally robust subgroups or to tackle
the pattern explosion, but we are the first to address both challenges at the
same time from a global modelling perspective. First, we formulate the broad
model class of subgroup lists, i.e., ordered sets of subgroups, for univariate
and multivariate targets that can consist of nominal or numeric variables, and
that includes traditional top-1 subgroup discovery in its definition. This
novel model class allows us to formalise the problem of optimal robust subgroup
discovery using the Minimum Description Length (MDL) principle, where we resort
to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and
numeric targets, respectively. Second, as finding optimal subgroup lists is
NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists
and guarantees that the most significant subgroup found according to the MDL
criterion is added in each iteration, which is shown to be equivalent to a
Bayesian one-sample proportions, multinomial, or t-test between the subgroup
and dataset marginal target distributions plus a multiple hypothesis testing
penalty. We empirically show on 54 datasets that SSD++ outperforms previous
subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ;
submitted to Data Mining and Knowledge Discovery Journa
Analyzing Granger causality in climate data with time series classification methods
Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested
Exceptional Model Mining
Finding subsets of a dataset that somehow deviate from the norm,
i.e. where something interesting is going on, is a classical Data Mining
task. In traditional local pattern mining methods, such deviations are
measured in terms of a relatively high occurrence (frequent itemset
mining), or an unusual distribution for one designated target attribute
(subgroup discovery). These, however, do not encompass all forms of
"interesting". To capture a more general notion of interestingness in
subsets of a dataset, we develop Exceptional Model Mining (EMM). This
is a supervised local pattern mining framework, where several target
attributes are selected, and a model over these attributes is chosen to
be the target concept. Then, subsets are sought on which this model is
substantially different from the model on the whole dataset. For
instance, we can find parts of the data where two target attributes have
an unusual correlation, a classifier has a deviating predictive
performance, or a Bayesian network fitted on several target attributes
has an exceptional structure. We will discuss some real-world
applications of EMM instances, including using the Bayesian network
model to identify meteorological conditions under which food chains are
displaced, and using a regression model to find the subset of households
in the Chinese province of Hunan that do not follow the general
economic law of demand.This research is supported by the Netherlands Organisation for Scientific Research (NWO) under project number 612.065.822 (Exceptional Model Mining).Algorithms and the Foundations of Software technolog