382 research outputs found
Efficient Bayesian Learning Curve Extrapolation using Prior-Data Fitted Networks
Learning curve extrapolation aims to predict model performance in later
epochs of training, based on the performance in earlier epochs. In this work,
we argue that, while the inherent uncertainty in the extrapolation of learning
curves warrants a Bayesian approach, existing methods are (i) overly
restrictive, and/or (ii) computationally expensive. We describe the first
application of prior-data fitted neural networks (PFNs) in this context. A PFN
is a transformer, pre-trained on data generated from a prior, to perform
approximate Bayesian inference in a single forward pass. We propose LC-PFN, a
PFN trained to extrapolate 10 million artificial right-censored learning curves
generated from a parametric prior proposed in prior art using MCMC. We
demonstrate that LC-PFN can approximate the posterior predictive distribution
more accurately than MCMC, while being over 10 000 times faster. We also show
that the same LC-PFN achieves competitive performance extrapolating a total of
20 000 real learning curves from four learning curve benchmarks (LCBench,
NAS-Bench-201, Taskset, and PD1) that stem from training a wide range of model
architectures (MLPs, CNNs, RNNs, and Transformers) on 53 different datasets
with varying input modalities (tabular, image, text, and protein data).
Finally, we investigate its potential in the context of model selection and
find that a simple LC-PFN based predictive early stopping criterion obtains 2 -
6x speed-ups on 45 of these datasets, at virtually no overhead
Supervising the Multi-Fidelity Race of Hyperparameter Configurations
Multi-fidelity (gray-box) hyperparameter optimization techniques (HPO) have
recently emerged as a promising direction for tuning Deep Learning methods.
However, existing methods suffer from a sub-optimal allocation of the HPO
budget to the hyperparameter configurations. In this work, we introduce DyHPO,
a Bayesian Optimization method that learns to decide which hyperparameter
configuration to train further in a dynamic race among all feasible
configurations. We propose a new deep kernel for Gaussian Processes that embeds
the learning curve dynamics, and an acquisition function that incorporates
multi-budget information. We demonstrate the significant superiority of DyHPO
against state-of-the-art hyperparameter optimization methods through
large-scale experiments comprising 50 datasets (Tabular, Image, NLP) and
diverse architectures (MLP, CNN/NAS, RNN).Comment: Accepted at NeurIPS 202
Scaling Laws for Hyperparameter Optimization
Hyperparameter optimization is an important subfield of machine learning that
focuses on tuning the hyperparameters of a chosen algorithm to achieve peak
performance. Recently, there has been a stream of methods that tackle the issue
of hyperparameter optimization, however, most of the methods do not exploit the
dominant power law nature of learning curves for Bayesian optimization. In this
work, we propose Deep Power Laws (DPL), an ensemble of neural network models
conditioned to yield predictions that follow a power-law scaling pattern. Our
method dynamically decides which configurations to pause and train
incrementally by making use of gray-box evaluations. We compare our method
against 7 state-of-the-art competitors on 3 benchmarks related to tabular,
image, and NLP datasets covering 59 diverse tasks. Our method achieves the best
results across all benchmarks by obtaining the best any-time results compared
to all competitors.Comment: Accepted at NeurIPS 202
Practical Block-wise Neural Network Architecture Generation
Convolutional neural networks have gained a remarkable success in computer
vision. However, most usable network architectures are hand-crafted and usually
require expertise and elaborate design. In this paper, we provide a block-wise
network generation pipeline called BlockQNN which automatically builds
high-performance networks using the Q-Learning paradigm with epsilon-greedy
exploration strategy. The optimal network block is constructed by the learning
agent which is trained sequentially to choose component layers. We stack the
block to construct the whole auto-generated network. To accelerate the
generation process, we also propose a distributed asynchronous framework and an
early stop strategy. The block-wise generation brings unique advantages: (1) it
performs competitive results in comparison to the hand-crafted state-of-the-art
networks on image classification, additionally, the best network generated by
BlockQNN achieves 3.54% top-1 error rate on CIFAR-10 which beats all existing
auto-generate networks. (2) in the meanwhile, it offers tremendous reduction of
the search space in designing networks which only spends 3 days with 32 GPUs,
and (3) moreover, it has strong generalizability that the network built on
CIFAR also performs well on a larger-scale ImageNet dataset.Comment: Accepted to CVPR 201
- …