422 research outputs found
Accounting for variance and hyperparameter optimization in machine learning benchmarks
La récente révolution de l'apprentissage automatique s'est fortement appuyée sur l'utilisation de bancs de test standardisés. Ces derniers sont au centre de la méthodologie scientifique en apprentissage automatique, fournissant des cibles et mesures indéniables des améliorations des algorithmes d'apprentissage. Ils ne garantissent cependant pas la validité des résultats ce qui implique que certaines conclusions scientifiques sur les avancées en intelligence artificielle peuvent s'avérer erronées.
Nous abordons cette question dans cette thèse en soulevant d'abord la problématique (Chapitre 5), que nous étudions ensuite plus en profondeur pour apporter des solutions (Chapitre 6) et finalement developpons un nouvel outil afin d'amélioration la méthodologie des chercheurs (Chapitre 7).
Dans le premier article, chapitre 5, nous démontrons la problématique de la reproductibilité pour des bancs de test stables et consensuels, impliquant que ces problèmes sont endémiques aussi à de grands ensembles d'applications en apprentissage automatique possiblement moins stable et moins consensuels. Dans cet article, nous mettons en évidence l'impact important de la stochasticité des bancs de test, et ce même pour les plus stables tels que la classification d'images. Nous soutenons d'après ces résultats que les solutions doivent tenir compte de cette stochasticité pour améliorer la reproductibilité des bancs de test.
Dans le deuxième article, chapitre 6, nous étudions les différentes sources de variation typiques aux bancs de test en apprentissage automatique, mesurons l'effet de ces variations sur les méthodes de comparaison d'algorithmes et fournissons des recommandations sur la base de nos résultats. Une contribution importante de ce travail est la mesure de la fiabilité d'estimateurs peu coûteux à calculer mais biaisés servant à estimer la performance moyenne des algorithmes. Tel qu'expliqué dans l'article, un estimateur idéal implique plusieurs exécution d'optimisation
d'hyperparamètres ce qui le rend trop coûteux à calculer. La plupart des chercheurs doivent donc recourir à l'alternative biaisée, mais nous ne savions pas jusqu'à présent la magnitude de la dégradation de cet estimateur. Sur la base de nos résultats, nous fournissons des recommandations pour la comparison d'algorithmes sur des bancs de test avec des budgets de calculs limités. Premièrement, les sources de variations devraient être randomisé autant que possible. Deuxièmement, la randomization devrait inclure le partitionnement aléatoire des données pour les ensembles d'entraînement, de validation et de test, qui s'avère être la plus importante des sources de variance. Troisièmement, des tests statistiques tel que la version du Mann-Withney U-test présenté dans notre article devrait être utilisé plutôt que des comparisons sur la simple base de moyennes afin de prendre en considération l'incertitude des mesures de performance.
Dans le chapitre 7, nous présentons un cadriciel d'optimisation d'hyperparamètres développé avec principal objectif de favoriser les bonnes pratiques d'optimisation des hyperparamètres. Le cadriciel est conçu de façon à privilégier une interface simple et intuitive adaptée aux habitudes de travail des chercheurs en apprentissage automatique. Il inclut un nouveau système de versionnage d'expériences afin d'aider les chercheurs à organiser leurs itérations expérimentales et tirer profit des résultats antérieurs pour augmenter l'efficacité de l'optimisation des hyperparamètres. L'optimisation des hyperparamètres joue un rôle important dans les bancs de test, les hyperparamètres étant un facteur confondant significatif. Fournir aux chercheurs un instrument afin de bien contrôler ces facteurs confondants est complémentaire aux recommandations pour tenir compte des sources de variation dans le chapitre 6.
Nos recommendations et l'outil pour l'optimisation d'hyperparametre offre une base solide pour une méthodologie robuste et fiable.The recent revolution in machine learning has been strongly based on the use of standardized benchmarks. Providing clear target metrics and undeniable measures of improvements of learning algorithms, they are at the center of the scientific methodology in machine learning. They do not ensure validity of results however, therefore some scientific conclusions based on flawed methodology may prove to be wrong.
In this thesis we address this question by first raising the issue (Chapter 5), then we study it to find solutions and recommendations (Chapter 6) and build tools to help improve the methodology of researchers (Chapter 7).
In first article, Chapter 5, we demonstrate the issue of reproducibility in stable and consensual benchmarks, implying that these issues are endemic to a large ensemble of machine learning applications that are possibly less stable or less consensual. We raise awareness of the important impact of stochasticity even in stable image classification tasks and contend that solutions for reproducible benchmarks should account for this stochasticity.
In second article, Chapter 6, we study the different sources of variation that are typical in machine learning benchmarks, measure their effect on comparison methods to benchmark algorithms and provide recommendations based on our results. One important contribution of this work is that we measure the reliability of a cheaper but biased estimator for the average performance of algorithms. As explained in the article, an ideal estimator involving multiple rounds of hyperparameter optimization is too computationally expensive. Most researchers must resort to use the biased alternative, but it has been unknown until now how serious a degradation of the quality of estimation this leads to. Our investigations provides guidelines for benchmarks on practical budgets. First, as many sources of variations as possible should be randomized. Second, the partitioning of data in training, validation and test sets should be randomized as well, since this is the most important source of
variation. Finally, statistical tests should be used instead of ad-hoc average comparisons so that the uncertainty of performance estimation can be accounted for when comparing machine learning algorithms.
In Chapter 7, we present a framework for hyperparameter optimization that has been developed with the main goal of encouraging best practices for hyperparameter optimization. The framework is designed to favor a simple and intuitive interface adapted to the workflow of machine learning researchers. It includes a new version control system for experiments to help researchers organize their rounds of experimentations and leverage prior results for more efficient hyperparameter optimization. Hyperparameter optimization plays an important role in benchmarking, with the effect of hyperparameters being a serious confounding factor. Providing an instrument for researchers to properly control this confounding factor is complementary to our
guidelines to account for sources of variation in Chapter 7.
Our recommendations together with our tool for hyperparameter optimization provides a solid basis for a reliable methodology in machine learning benchmarks
Multi-objective Tree-structured Parzen Estimator Meets Meta-learning
Hyperparameter optimization (HPO) is essential for the better performance of
deep learning, and practitioners often need to consider the trade-off between
multiple metrics, such as error rate, latency, memory requirements, robustness,
and algorithmic fairness. Due to this demand and the heavy computation of deep
learning, the acceleration of multi-objective (MO) optimization becomes ever
more important. Although meta-learning has been extensively studied to speedup
HPO, existing methods are not applicable to the MO tree-structured parzen
estimator (MO-TPE), a simple yet powerful MO-HPO algorithm. In this paper, we
extend TPE's acquisition function to the meta-learning setting, using a task
similarity defined by the overlap in promising domains of each task. In a
comprehensive set of experiments, we demonstrate that our method accelerates
MO-TPE on tabular HPO benchmarks and yields state-of-the-art performance. Our
method was also validated externally by winning the AutoML 2022 competition on
"Multiobjective Hyperparameter Optimization for Transformers".Comment: Meta-learning workshop on NeurIPS 202
Efficient Bayesian Learning Curve Extrapolation using Prior-Data Fitted Networks
Learning curve extrapolation aims to predict model performance in later
epochs of training, based on the performance in earlier epochs. In this work,
we argue that, while the inherent uncertainty in the extrapolation of learning
curves warrants a Bayesian approach, existing methods are (i) overly
restrictive, and/or (ii) computationally expensive. We describe the first
application of prior-data fitted neural networks (PFNs) in this context. A PFN
is a transformer, pre-trained on data generated from a prior, to perform
approximate Bayesian inference in a single forward pass. We propose LC-PFN, a
PFN trained to extrapolate 10 million artificial right-censored learning curves
generated from a parametric prior proposed in prior art using MCMC. We
demonstrate that LC-PFN can approximate the posterior predictive distribution
more accurately than MCMC, while being over 10 000 times faster. We also show
that the same LC-PFN achieves competitive performance extrapolating a total of
20 000 real learning curves from four learning curve benchmarks (LCBench,
NAS-Bench-201, Taskset, and PD1) that stem from training a wide range of model
architectures (MLPs, CNNs, RNNs, and Transformers) on 53 different datasets
with varying input modalities (tabular, image, text, and protein data).
Finally, we investigate its potential in the context of model selection and
find that a simple LC-PFN based predictive early stopping criterion obtains 2 -
6x speed-ups on 45 of these datasets, at virtually no overhead
Neural Architecture Search: Insights from 1000 Papers
In the past decade, advances in deep learning have resulted in breakthroughs
in a variety of areas, including computer vision, natural language
understanding, speech recognition, and reinforcement learning. Specialized,
high-performing neural architectures are crucial to the success of deep
learning in these areas. Neural architecture search (NAS), the process of
automating the design of neural architectures for a given task, is an
inevitable next step in automating machine learning and has already outpaced
the best human-designed architectures on many tasks. In the past few years,
research in NAS has been progressing rapidly, with over 1000 papers released
since 2020 (Deng and Lindauer, 2021). In this survey, we provide an organized
and comprehensive guide to neural architecture search. We give a taxonomy of
search spaces, algorithms, and speedup techniques, and we discuss resources
such as benchmarks, best practices, other surveys, and open-source libraries
Efficient Automated Deep Learning for Time Series Forecasting
Recent years have witnessed tremendously improved efficiency of Automated
Machine Learning (AutoML), especially Automated Deep Learning (AutoDL) systems,
but recent work focuses on tabular, image, or NLP tasks. So far, little
attention has been paid to general AutoDL frameworks for time series
forecasting, despite the enormous success in applying different novel
architectures to such tasks. In this paper, we propose an efficient approach
for the joint optimization of neural architecture and hyperparameters of the
entire data processing pipeline for time series forecasting. In contrast to
common NAS search spaces, we designed a novel neural architecture search space
covering various state-of-the-art architectures, allowing for an efficient
macro-search over different DL approaches. To efficiently search in such a
large configuration space, we use Bayesian optimization with multi-fidelity
optimization. We empirically study several different budget types enabling
efficient multi-fidelity optimization on different forecasting datasets.
Furthermore, we compared our resulting system, dubbed \system, against several
established baselines and show that it significantly outperforms all of them
across several datasets
AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks
The fields of both Natural Language Processing (NLP) and Automated Machine
Learning (AutoML) have achieved remarkable results over the past years. In NLP,
especially Large Language Models (LLMs) have experienced a rapid series of
breakthroughs very recently. We envision that the two fields can radically push
the boundaries of each other through tight integration. To showcase this
vision, we explore the potential of a symbiotic relationship between AutoML and
LLMs, shedding light on how they can benefit each other. In particular, we
investigate both the opportunities to enhance AutoML approaches with LLMs from
different perspectives and the challenges of leveraging AutoML to further
improve LLMs. To this end, we survey existing work, and we critically assess
risks. We strongly believe that the integration of the two fields has the
potential to disrupt both fields, NLP and AutoML. By highlighting conceivable
synergies, but also risks, we aim to foster further exploration at the
intersection of AutoML and LLMs
- …