133 research outputs found
Model and Algorithm Selection in Statistical Learning and Optimization.
Modern data-driven statistical techniques, e.g., non-linear classification and
regression machine learning methods, play an increasingly important role in applied data analysis
and quantitative research. For real-world we do not know
a priori which methods will work best. Furthermore, most of the available models depend on
so called hyper- or control parameters, which can drastically influence their performance.
This leads to a vast space of potential models, which cannot be explored exhaustively.
Modern optimization techniques, often either evolutionary or model-based, are employed to speed up
this process.
A very similar problem occurs in continuous and discrete optimization and, in general,
in many other areas where problem instances are solved by algorithmic approaches: Many competing
techniques exist, some of them heavily parametrized. Again, not much knowledge
exists, how, given a certain application, one makes the correct choice here.
These general problems are called algorithm selection and algorithm configuration. Instead of relying on
tedious, manual trial-and-error, one should rather employ available computational power
in a methodical fashion to obtain an appropriate algorithmic choice, while supporting this
process with machine-learning techniques to discover and exploit as much of the
search space structure as possible.
In this cumulative dissertation I summarize nine papers that deal with the problem of model and
algorithm selection in the areas of machine learning and optimization. Issues in benchmarking,
resampling, efficient model tuning, feature selection and automatic algorithm selection are addressed and
solved using modern techniques. I apply these methods to tasks from engineering, music data analysis
and black-box optimization.
The dissertation concludes by summarizing my published R packages for such tasks and specifically
discusses two packages for parallelization on high performance computing clusters and parallel statistical
experiments
OpenML: networked science in machine learning
Many sciences have made significant breakthroughs by adopting online tools
that help organize, structure and mine information that is too detailed to be
printed in journals. In this paper, we introduce OpenML, a place for machine
learning researchers to share and organize data in fine detail, so that they
can work more effectively, be more visible, and collaborate with others to
tackle harder problems. We discuss how OpenML relates to other examples of
networked science and what benefits it brings for machine learning research,
individual scientists, as well as students and practitioners.Comment: 12 pages, 10 figure
Multilabel Classification with R Package mlr
We implemented several multilabel classification algorithms in the machine
learning package mlr. The implemented methods are binary relevance, classifier
chains, nested stacking, dependent binary relevance and stacking, which can be
used with any base learner that is accessible in mlr. Moreover, there is access
to the multilabel classification versions of randomForestSRC and rFerns. All
these methods can be easily compared by different implemented multilabel
performance measures and resampling methods in the standardized mlr framework.
In a benchmark experiment with several multilabel datasets, the performance of
the different methods is evaluated.Comment: 18 pages, 2 figures, to be published in R Journal; reference
correcte
Frequency estimation by DFT interpolation: a comparison of methods
This article comments on a frequency estimator which was proposed by [6] and shows empirically that it exhibits a much larger mean squared error than a well known frequency estimator by [8]. It is demonstrated that by using a heuristical adjustment [2] the performance can be greatly improved. Furthermore, references to two modern techniques are given, which both nearly attain the Cramér-Rao bound for this estimation problem
Decomposing Global Feature Effects Based on Feature Interactions
Global feature effect methods, such as partial dependence plots, provide an
intelligible visualization of the expected marginal feature effect. However,
such global feature effect methods can be misleading, as they do not represent
local feature effects of single observations well when feature interactions are
present. We formally introduce generalized additive decomposition of global
effects (GADGET), which is a new framework based on recursive partitioning to
find interpretable regions in the feature space such that the
interaction-related heterogeneity of local feature effects is minimized. We
provide a mathematical foundation of the framework and show that it is
applicable to the most popular methods to visualize marginal feature effects,
namely partial dependence, accumulated local effects, and Shapley additive
explanations (SHAP) dependence. Furthermore, we introduce a new
permutation-based interaction test to detect significant feature interactions
that is applicable to any feature effect method that fits into our proposed
framework. We empirically evaluate the theoretical characteristics of the
proposed methods based on various feature effect methods in different
experimental settings. Moreover, we apply our introduced methodology to two
real-world examples to showcase their usefulness
- …