    Hyperparameter Importance Across Datasets

    With the advent of automated machine learning, automated hyperparameter optimization methods are by now routinely used in data mining. However, this progress is not yet matched by equal progress on automatic analyses that yield information beyond performance-optimizing hyperparameter settings. In this work, we aim to answer the following two questions: Given an algorithm, what are generally its most important hyperparameters, and what are typically good values for these? We present methodology and a framework to answer these questions based on meta-learning across many datasets. We apply this methodology using the experimental meta-data available on OpenML to determine the most important hyperparameters of support vector machines, random forests and Adaboost, and to infer priors for all their hyperparameters. The results, obtained fully automatically, provide a quantitative basis to focus efforts in both manual algorithm design and in automated hyperparameter optimization. The conducted experiments confirm that the hyperparameters selected by the proposed method are indeed the most important ones and that the obtained priors also lead to statistically significant improvements in hyperparameter optimization.Comment: \c{opyright} 2018. Copyright is held by the owner/author(s). Publication rights licensed to ACM. This is the author's version of the work. It is posted here for your personal use, not for redistribution. The definitive Version of Record was published in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Minin

    Meta-Learning for Symbolic Hyperparameter Defaults

    Hyperparameter optimization in machine learning (ML) deals with the problem of empirically learning an optimal algorithm configuration from data, usually formulated as a black-box optimization problem. In this work, we propose a zero-shot method to meta-learn symbolic default hyperparameter configurations that are expressed in terms of the properties of the dataset. This enables a much faster, but still data-dependent, configuration of the ML algorithm, compared to standard hyperparameter optimization approaches. In the past, symbolic and static default values have usually been obtained as hand-crafted heuristics. We propose an approach of learning such symbolic configurations as formulas of dataset properties from a large set of prior evaluations on multiple datasets by optimizing over a grammar of expressions using an evolutionary algorithm. We evaluate our method on surrogate empirical performance models as well as on real data across 6 ML algorithms on more than 100 datasets and demonstrate that our method indeed finds viable symbolic defaults.Comment: Pieter Gijsbers and Florian Pfisterer contributed equally to the paper. V1: Two page GECCO poster paper accepted at GECCO 2021. V2: The original full length paper (8 pages) with appendi

    Automated Machine Learning - Bayesian Optimization, Meta-Learning & Applications

    Automating machine learning by providing techniques that autonomously find the best algorithm, hyperparameter configuration and preprocessing is helpful for both researchers and practitioners. Therefore, it is not surprising that automated machine learning has become a very interesting field of research. Bayesian optimization has proven to be a very successful tool for automated machine learning. In the first part of the thesis we present different approaches to improve Bayesian optimization by means of transfer learning. We present three different ways of considering meta-knowledge in Bayesian optimization, i.e. search space pruning, initialization and transfer surrogate models. Finally, we present a general framework for Bayesian optimization combined with meta-learning and conduct a comparison among existing work on two different meta-data sets. A conclusion is that in particular the meta-target driven approaches provide better results. Choosing algorithm configurations based on the improvement on the meta-knowledge combined with the expected improvement yields best results. The second part of this thesis is more application-oriented. Bayesian optimization is applied to large data sets and used as a tool to participate in machine learning challenges. We compare its autonomous performance and its performance in combination with a human expert. At two ECML-PKDD Discovery Challenges, we are able to show that automated machine learning outperforms human machine learning experts. Finally, we present an approach that automates the process of creating an ensemble of several layers, different algorithms and hyperparameter configurations. These kinds of ensembles are jokingly called Frankenstein ensembles and proved their benefit on versatile data sets in many machine learning challenges. We compare our approach Automatic Frankensteining with the current state of the art for automated machine learning on 80 different data sets and can show that it outperforms them on the majority using the same training time. Furthermore, we compare Automatic Frankensteining on a large-scale data set to more than 3,500 machine learning expert teams and are able to outperform more than 3,000 of them within 12 CPU hours.Die Automatisierung des Maschinellen Lernens erlaubt es ohne menschliche Mitwirkung den besten Algorithmus, die dazugehörige beste Konfiguration und die optimale Vorverarbeitung des Datensatzes zu bestimmen und ist daher hilfreich für Anwender mit und ohne fachlichen Hintergrund. Aus diesem Grund ist es wenig überraschend, dass die Automatisierung des Maschinellen Lernens zu einem populären Forschungsgebiet aufgestiegen ist. Bayessche Optimierung hat sich als eins der erfolgreicheren Werkzeuge für das automatisierte Maschinelle Lernen hervorgetan. Im ersten Teil dieser Arbeit werden verschiedene Methoden vorge-stellt, die Bayessche Optimierung mittels Lerntransfer auch über Probleme hinweg verbessern kann. Es werden drei Möglichkeiten vorgestellt, um Wissen von zuvor adressierten Problemen auf neue zu Übertragen: Suchraumreduzierung, Initialisierung und transferierende Ersatzmodelle. Schließlich wird ein allgemeines Framework für Bayessche Optimierung beschrieben, welches existierende Meta-lernansätze berücksichtigt und mit schon existierenden Arbeiten auf zwei Meta-Datensätzen verglichen. Die beschriebenen Ansätze, die direkt die Meta-Zielfunktion optimieren, liefern tendenziell bessere Ergebnisse. Die Wahl der Algorithmuskonfiguration basierend auf Meta-Wissen kombiniert mit der zu erwartenen Verbesserung erweist sich als beste Methode. Der zweite Teil der Arbeit ist anwendungsorientierter. Bayessche Optimierung wird im Rahmen von Wettbewerben auf großen Datensätzen angewandt, um Algorithmen des Maschinellen Lernens zu optimieren. Es wird sowohl die eigenständige Leistung der automatisierten Methode als auch die Leistung in Kombination mit einem menschlichen Experten bewertet. Durch die Teilnahme an zwei ECML-PKDD Wettbewerben wird gezeigt, dass das automatisierte Verfahren menschliche Konkurrenten übertreffen kann. Abschließend wird eine Methode vorgestellt, die automatisch ein mehrschichtiges Ensemble erstellt, welches aus verschiedenen Algorithmen und entsprechenden Konfigurationen besteht. In der Vergangenheit hat sich gezeigt, dass diese Art von Ensemble die besten Vorhersagen liefern kann. Die beschriebende Methode zur automatisierten Erstellung dieser Ensemble wird mit Hilfe von 80 Datensätzen mit existierenden Konkurrenzansätzen verglichen und erreicht innerhalb derselben Zeit auf der Mehrzahl der Datensätze bessere Ergebnisse. Diese Methode wird zusätzlich mit 3.500 Teams von Experten des Maschinellen Lernens auf einem größeren Datensatz verglichen. Es zeigt sich, dass die automatisierte Methodik schon innerhalb von 12 CPU Stunden bessere Ergebnisse liefert als 3.000 der menschlichen Teilnehmer des Wettbewerbs

    Transfer Learning for Multi-surrogate-model Optimization

    Surrogate-model-based optimization is widely used to solve black-box optimization problems if the evaluation of a target system is expensive. However, when the optimization budget is limited to a single or several evaluations, surrogate-model-based optimization may not perform well due to the lack of knowledge about the search space. In this case, transfer learning helps to get a good optimization result due to the usage of experience from the previous optimization runs. And if the budget is not strictly limited, transfer learning is capable of improving the final results of black-box optimization. The recent work in surrogate-model-based optimization showed that using multiple surrogates (i.e., applying multi-surrogate-model optimization) can be extremely efficient in complex search spaces. The main assumption of this thesis suggests that transfer learning can further improve the quality of multi-surrogate-model optimization. However, to the best of our knowledge, there exist no approaches to transfer learning in the multi-surrogate-model context yet. In this thesis, we propose an approach to transfer learning for multi-surrogate-model optimization. It encompasses an improved method of defining the expediency of knowledge transfer, adapted multi-surrogate-model recommendation, multi-task learning parameter tuning, and few-shot learning techniques. We evaluated the proposed approach with a set of algorithm selection and parameter setting problems, comprising mathematical functions optimization and the traveling salesman problem, as well as random forest hyperparameter tuning over OpenML datasets. The evaluation shows that the proposed approach helps to improve the quality delivered by multi-surrogate-model optimization and ensures getting good optimization results even under a strictly limited budget.:1 Introduction 1.1 Motivation 1.2 Research objective 1.3 Solution overview 1.4 Thesis structure 2 Background 2.1 Optimization problems 2.2 From single- to multi-surrogate-model optimization 2.2.1 Classical surrogate-model-based optimization 2.2.2 The purpose of multi-surrogate-model optimization 2.2.3 BRISE 2.5.0: Multi-surrogate-model-based software product line for parameter tuning 2.3 Transfer learning 2.3.1 Definition and purpose of transfer learning 2.4 Summary of the Background 3 Related work 3.1 Questions to transfer learning 3.2 When to transfer: Existing approaches to determining the expediency of knowledge transfer 3.2.1 Meta-features-based approaches 3.2.2 Surrogate-model-based similarity 3.2.3 Relative landmarks-based approaches 3.2.4 Sampling landmarks-based approaches 3.2.5 Similarity threshold problem 3.3 What to transfer: Existing approaches to knowledge transfer 3.3.1 Ensemble learning 3.3.2 Search space pruning 3.3.3 Multi-task learning 3.3.4 Surrogate model recommendation 3.3.5 Few-shot learning 3.3.6 Other approaches to transferring knowledge 3.4 How to transfer (discussion): Peculiarities and required design decisions for the TL implementation in multi-surrogate-model setup 3.4.1 Peculiarities of model recommendation in multi-surrogate-model setup 3.4.2 Required design decisions in multi-task learning 3.4.3 Few-shot learning problem 3.5 Summary of the related work analysis 4 Transfer learning for multi-surrogate-model optimization 4.1 Expediency of knowledge transfer 4.1.1 Experiments’ similarity definition as a variability point 4.1.2 Clustering to filter the most suitable experiments 4.2 Dynamic model recommendation in multi-surrogate-model setup 4.2.1 Variable recommendation granularity 4.2.2 Model recommendation by time and performance criteria 4.3 Multi-task learning 4.4 Implementation of the proposed concept 4.5 Conclusion of the proposed concept 5 Evaluation 5.1 Benchmark suite 5.1.1 APSP for the meta-heuristics 5.1.2 Hyperparameter optimization of the Random Forest algorithm 5.2 Environment setup 5.3 Evaluation plan 5.4 Baseline evaluation 5.5 Meta-tuning for a multi-task learning approach 5.5.1 Revealing the dependencies between the parameters of multi-task learning and its performance 5.5.2 Multi-task learning performance with the best found parameters 5.6 Expediency determination approach 5.6.1 Expediency determination as a variability point 5.6.2 Flexible number of the most similar experiments with the help of clustering 5.6.3 Influence of the number of initial samples on the quality of expediency determination 5.7 Multi-surrogate-model recommendation 5.8 Few-shot learning 5.8.1 Transfer of the built surrogate models’ combination 5.8.2 Transfer of the best configuration 5.8.3 Transfer from different experiment instances 5.9 Summary of the evaluation results 6 Conclusion and Future wor

    Automatic Selection of MapReduce Machine Learning Algorithms: A Model Building Approach

    As the amount of information available for data mining grows larger, the amount of time needed to train models on those huge volumes of data also grows longer. Techniques such as sub-sampling and parallel algorithms have been employed to deal with this growth. Some studies have shown that sub-sampling can have adverse effects on the quality of models produced, and the degree to which it affects different types of learning algorithms varies. Parallel algorithms perform well when enough computing resources (e.g. cores, memory) are available, however for a limited sized cluster the growth in data will still cause an unacceptable growth in model training time. In addition to the data size mitigation problem, picking which algorithms are well suited to a particular dataset, can be a challenge. While some studies have looked at selection criteria for picking a learning algorithm based on the properties of the dataset, the additional complexity of parallel learners or possible run time limitations has not been considered. This study explores run time and model quality results of various techniques for dealing with large datasets, including using different numbers of compute cores, sub-sampling the datasets, and exploiting the iterative anytime nature of the training algorithms. The algorithms were studied using MapReduce implementations of four supervised learning algorithms, logistic regression, tree induction, bagged trees, and boosted stumps for binary classification using probabilistic models. Evaluation of these techniques was done using a modified form of learning curves which has a temporal component. Finally, the data collected was used to train a set of models to predict which type of parallel learner best suits a particular dataset, given run time limitations and the number of compute cores to be used. The predictions of those models were then compared to the actual results of running the algorithms on the datasets they were attempting to predict