The performance of modern machine learning methods highly depends on their
hyperparameter configurations. One simple way of selecting a configuration is
to use default settings, often proposed along with the publication and
implementation of a new algorithm. Those default values are usually chosen in
an ad-hoc manner to work good enough on a wide variety of datasets. To address
this problem, different automatic hyperparameter configuration algorithms have
been proposed, which select an optimal configuration per dataset. This
principled approach usually improves performance, but adds additional
algorithmic complexity and computational costs to the training procedure. As an
alternative to this, we propose learning a set of complementary default values
from a large database of prior empirical results. Selecting an appropriate
configuration on a new dataset then requires only a simple, efficient and
embarrassingly parallel search over this set. We demonstrate the effectiveness
and efficiency of the approach we propose in comparison to random search and
Bayesian Optimization

Bischl, Bernd

Müller, Andreas

Pfisterer, Florian

Probst, Philipp

van Rijn, Jan N.

English

arXiv

Florian Pfisterer

Jan N. van Rijn

Philipp Probst

Andreas C. Müller

Bernd Bischl

Crossref

Learning multiple defaults for machine learning algorithms

Pfisterer, F.

Rijn, J.N. van

Probst, P.

Müller, A.C.

Bischl, B.

Leiden University Scholary Publications

Learning multiple defaults for machine learning algorithmsPfisterer, F.; Rijn, J.N. van; Probst, P.; Müller, A.C.; Bischl, B.; Chicano, F.CitationPfisterer, F., Rijn, J. N. van, Probst, P., Müller, A. C., & Bischl, B. (2021). Learning multipledefaults for machine learning algorithms. Gecco '21: Proceedings Of The Genetic AndEvolutionary Computation Conference Companion, 241-242. doi:10.1145/3449726.3459523 Version: Publisher's VersionLicense: Licensed under Article 25fa Copyright Act/Law (Amendment Taverne)Downloaded from: https://hdl.handle.net/1887/3277256 Note: To cite this publication please use the final published version (if applicable).Learning Multiple Defaults for Machine Learning AlgorithmsFlorian PfistererLudwig-Maximilians-UniversityMunich, GermanyJan N. van RijnLIACS, Leiden UniversityLeiden, NetherlandsPhilipp ProbstBenediktbeuern, GermanyAndreas C. MüllerMicrosoftSunnyvale, U.S.A.Bernd BischlLudwig-Maximilians-UniversityMunich, GermanyABSTRACTModern machine learning methods highly depend on their hyper-parameter configurations for optimal performance. A widely usedapproach to selecting a configuration is using default settings, of-ten proposed along with the publication of a new algorithm. Thosedefault values are usually chosen in an ad-hoc manner to workon a wide variety of datasets. Different automatic hyperparameterconfiguration algorithms which select an optimal configuration perdataset have been proposed, but despite its importance, tuning isoften skipped in applications because of additional run time, com-plexity, and experimental design questions. Instead, the learner isoften applied in its defaults. This principled approach usually im-proves performance but adds additional algorithmic complexity andcomputational costs to the training procedure. We propose and studyusing a set of complementary default values, learned from a largedatabase of prior empirical results as an alternative. Selecting an ap-propriate configuration on a new dataset then requires only a simple,efficient, and embarrassingly parallel search over this set. To demon-strate the effectiveness and efficiency of the approach, we comparelearned sets of configurations to random search and Bayesian opti-mization. We show that sets of defaults can improve performancewhile being easy to deploy in comparison to more complex methods.CCS CONCEPTS• Computing methodologies → Supervised learning by classifi-cation.KEYWORDSAutoML, Hyperparameter Optimization, MetalearningACM Reference Format:Florian Pfisterer, Jan N. van Rijn, Philipp Probst, Andreas C. Müller, and BerndBischl. 2021. Learning Multiple Defaults for Machine Learning Algorithms.In 2021 Genetic and Evolutionary Computation Conference Companion(GECCO ’21 Companion), July 10–14, 2021, Lille, France. ACM, New York,NY, USA, 2 pages. https://doi.org/10.1145/3449726.3459523Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).GECCO ’21 Companion, July 10–14, 2021, Lille, France© 2021 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-8351-6/21/07.https://doi.org/10.1145/3449726.34595231 INTRODUCTIONHyperparameter settings for machine learning algorithms are of-ten optimized via hyperparameter optimization e.g. using randomsearch, Bayesian optimization, or meta learning. While not tuningparameters at all can be detrimental, defaults provide a simple andfast fall-back, that is easy to implement and use while providingstrong anytime performance. We describe a general, learner-agnosticprocedure, to (meta-)learn not one, but a (sequential) list of defaultconfigurations, which complement each other. These sets are orderedso that the earlier elements in the sequence provide greater benefitson average.1 While traditional optimization methods are to be pre-ferred when time and expertise are available, we conjecture that setsof defaults work well across a large variety of datasets. We leverage alarge set of historic performance results of prior experiments that areavailable on OpenML [4]. Several approaches attempt to combinethe paradigms of meta-learning and hyperparameter optimization,for example by warm starting hyperparameter optimization meth-ods [2, 5]. While all these methods yield convincing results, they areby no means easy to deploy. Similar to our work, Wistuba et al. [6]learn a set of defaults from a fixed grid of evaluations, requiringhyperparameters evaluated on a grid across several datasets scalingexponentially with hyperparameter dimensionality. This is practi-cally infeasible when there are large numbers of hyperparameters.2 METHODConsider a target variable 𝑦, a feature vector 𝒙, and an unknown jointdistribution 𝑃 on (𝒙, 𝑦), from which we have sampled a i.i.d datasetD. A machine learning algorithm A𝝀 (D) learns a prediction model𝑓 (𝒙). A𝝀 is controlled by a multi-dimensional hyperparameter con-figuration 𝝀 ∈ Λ of length 𝐷 , where Λ 𝑗 is usually a bounded real orinteger interval, or a finite set of categorical values. We are interestedin estimating the expected risk of the inducing algorithm w.r.t. 𝝀on new data, also sampled from P:𝑅P (𝝀) = 𝐸P (𝐿(𝑦,A𝝀 (D)(𝒙))),where the expectation above is taken over all data sets D from P andthe test observation (𝒙, 𝑦). Thus, 𝑅P (𝝀) quantifies the expected pre-dictive performance associated with a hyperparameter configuration𝝀 for a given data distribution, learning algorithm and performancemeasure. In practice, given 𝐾 different data sets we define 𝐾 hy-perparameter risk mappings: 𝑅𝑘 (𝝀) = 𝐸P𝑘 (𝐿(𝑦,A𝝀 (D)(𝒙))) andthe average risk of 𝝀 over 𝐾 data sets: 𝑅(𝝀) = 1𝐾∑𝐾𝑘=1 𝑅𝑘 (𝝀). Ourgoal now is to find a fixed-size set Λ𝑑𝑒𝑓 of size 𝑇 , that works wellover a variety of datasets, in the sense that for each dataset D, Λdefcontains at least one configuration that works well on D. The riskof a set of configurations Λdef of size 𝑇 , aggregation function ℎ (e.g.1Full version of this article: https://arxiv.org/abs/1811.09409241GECCO ’21 Companion, July 10–14, 2021, Lille, France Pfisterer et al.0.000.250.500.751.001 2 4 8 16 32Number of evaluationsNormalized Accuracy0.000.250.500.751.001 2 4 8 16 32Number of evaluationsNormalized Accuracy0.000.250.500.751.001 2 4 8 16 32Number of evaluationsNormalized AccuracyFigure 1: Defaults (red), random search (blue) and Bayesian optimization (green) across several budgets for Adaboost (left), RandomForest (middle) and SVM (right)mean) and datasets 1, . . . , 𝐾 is then given by:𝐺 (Λdef) = ℎ(min𝑗=1,...,𝑛𝑅1 (𝝀𝑡 ), . . . , min𝑡=1,...,𝑇𝑅𝐾 (𝝀𝑡 )))Finding an optimal subset Λdef defines a (meta)-learning problem,that can be solved exactly or using a greedy approximation.The exact version can be formulated as an instance of MixedInteger Programming. In order to obtain a set of 𝑛 defaults, the goalis to minimize𝐾∑𝑘=1𝑀∑𝑚=1Ψ𝑘,𝑚 · 𝑅𝑘 (𝝀𝑚) (1)subject to𝑀∑𝑚=1𝜙𝑚 = 𝑛∀𝑘 : ∀𝑚 : Ψ𝑘,𝑚 ≥ 𝜙𝑚 −∑𝑠∈𝑄 (𝑘,𝑚)𝜙𝑠∀𝑘 : ∀𝑚 : Ψ𝑘,𝑚 ≥ 0∀𝑘 :𝑀∑𝑚=1Ψ𝑘,𝑚 = 1After the optimization procedure, element Ψ𝑘,𝑚 will be 1 if and onlyif configuration Λ𝑚 has the lowest risk on distribution 𝑖 out of allthe configurations that are in the set of defaults. 𝜙𝑚 is an auxiliaryvariable. Since the exact solution is computationally prohibitivelyexpensive, we adopt a greedy procedure for t = 1, . . . ,𝑇 :𝝀def,𝑡 := argmin𝝀∈Λ𝐺 ({𝝀} ∪ Λdef,𝑡−1) (2)Λdef,𝑡 := {𝝀def,1, . . . ,𝝀def,𝑡 } (3)where Λ𝑑𝑒𝑓 ,0 = ∅, and the final solution Λdef = Λdef,𝑇 . It is possibleto estimate 𝑅𝑘 (𝝀) empirically using cross-validation, but since this iscomputationally expensive, we employ surrogate models that predictthe performance for a given hyperparameter configuration resultingin a fast approximate way to evaluate performances. This approachcan be extended to a set of defaults across algorithms.3 EXPERIMENTAL EVALUATIONWe estimate the generalization performance of our approach on fu-ture datasets by running a leave-one-dataset-out CV scheme over 𝐾datasets, estimating performances for each held-out dataset usingouter 10-fold CV and nested 5-fold CV for choosing the hyperpa-rameter. We compare to random search with several budgets andBayesian optimization with 32 iterations. We use ±137.000 experi-mental results available on OpenML [4] to evaluate the lists of de-faults on three algorithms from scikit-learn and 100 datasetsfrom the OpenML100 [1]. We evaluate using Adaboost (5), SVM(6), and random forest (6 hyperparameters) optimizing predictiveaccuracy. Hyperparameters and their respective ranges are the sameas used in [3]. Figure 1 presents the results of the set of defaultsobtained by our approach and baselines across 3 algorithms normal-ized to [0, 1] per algorithm and task and aggregate using the mean.For defaults and random search more iterations strictly improvesperformance. As expected, random search with only 1 or 2 iterationsperforms poorly, while Bayesian optimization is often among thebest strategies. We further observe that using only a few defaults isalready competitive with Bayesian optimization and higher budgetrandom search, often competitive with random search with 4 − 8times more budget. We note that using sets of defaults is especiallyworthwhile when either computation time or expertise on hyper-parameter optimization is lacking. Especially in the regime of fewfunction evaluations, sets of defaults seem to work well and arestatistically equivalent to state-of-the-art techniques. A potentialdrawback is that the defaults are optimal with respect to a singlemetric such as accuracy or AUC, and thus might need to be usedseparately for different evaluation metrics. Our results can readilybe implemented in machine learning software as simple, hard-codedlists of parameters. These will require less knowledge of hyperpa-rameter optimization from the users than current methods, and leadto faster results in many cases.Acknowledgements. This work has been funded by the GermanFederal Ministry of Education and Research (BMBF) under GrantNo. 01IS18036A. The authors of this work take full responsibilityfor its content.REFERENCES[1] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang,Rafael G Mantovani, Jan N van Rijn, and Joaquin Vanschoren. 2017. OpenMLBenchmarking Suites and the OpenML100. arXiv preprint arXiv:1708.03731v1(2017).[2] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. 2015. InitializingBayesian Hyperparameter Optimization via Meta-learning. In Proc. AAAI (Austin,Texas). AAAI Press, 1128–1135.[3] Jan N. van Rijn and Frank Hutter. 2018. Hyperparameter Importance AcrossDatasets. In Proc. of KDD. ACM, 2367–2376.[4] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. 2014. OpenML: networkedscience in machine learning. ACM SIGKDD Explorations Newsletter 15, 2 (2014),49–60.[5] Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. 2015. Learninghyperparameter optimization initializations. In Proc. of DSAA. IEEE, 1–10.[6] Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. 2015. SequentialModel-Free Hyperparameter Tuning. In Proc. of ICDM. 1033–1038.242

Learning Multiple Defaults for Machine Learning Algorithms

Learning Multiple Defaults for Machine Learning Algorithms

Abstract

Similar works

Full text

Available Versions

Crossref

Leiden University Scholary Publications