98 research outputs found
Information Directed Sampling and Bandits with Heteroscedastic Noise
In the stochastic bandit problem, the goal is to maximize an unknown function
via a sequence of noisy evaluations. Typically, the observation noise is
assumed to be independent of the evaluation point and to satisfy a tail bound
uniformly on the domain; a restrictive assumption for many applications. In
this work, we consider bandits with heteroscedastic noise, where we explicitly
allow the noise distribution to depend on the evaluation point. We show that
this leads to new trade-offs for information and regret, which are not taken
into account by existing approaches like upper confidence bound algorithms
(UCB) or Thompson Sampling. To address these shortcomings, we introduce a
frequentist regret analysis framework, that is similar to the Bayesian
framework of Russo and Van Roy (2014), and we prove a new high-probability
regret bound for general, possibly randomized policies, which depends on a
quantity we refer to as regret-information ratio. From this bound, we define a
frequentist version of Information Directed Sampling (IDS) to minimize the
regret-information ratio over all possible action sampling distributions. This
further relies on concentration inequalities for online least squares
regression in separable Hilbert spaces, which we generalize to the case of
heteroscedastic noise. We then formulate several variants of IDS for linear and
reproducing kernel Hilbert space response functions, yielding novel algorithms
for Bayesian optimization. We also prove frequentist regret bounds, which in
the homoscedastic case recover known bounds for UCB, but can be much better
when the noise is heteroscedastic. Empirically, we demonstrate in a linear
setting with heteroscedastic noise, that some of our methods can outperform UCB
and Thompson Sampling, while staying competitive when the noise is
homoscedastic.Comment: Figure 1a,2a update
Bias-Robust Bayesian Optimization via Dueling Bandits
We consider Bayesian optimization in settings where observations can be
adversarially biased, for example by an uncontrolled hidden confounder. Our
first contribution is a reduction of the confounded setting to the dueling
bandit model. Then we propose a novel approach for dueling bandits based on
information-directed sampling (IDS). Thereby, we obtain the first efficient
kernelized algorithm for dueling bandits that comes with cumulative regret
guarantees. Our analysis further generalizes a previously proposed
semi-parametric linear bandit model to non-linear reward functions, and
uncovers interesting links to doubly-robust estimation
Stochastic Bandits with Context Distributions
We introduce a stochastic contextual bandit model where at each time step the
environment chooses a distribution over a context set and samples the context
from this distribution. The learner observes only the context distribution
while the exact context realization remains hidden. This allows for a broad
range of applications where the context is stochastic or when the learner needs
to predict the context. We leverage the UCB algorithm to this setting and show
that it achieves an order-optimal high-probability bound on the cumulative
regret for linear and kernelized reward functions. Our results strictly
generalize previous work in the sense that both our model and the algorithm
reduce to the standard setting when the environment chooses only Dirac delta
distributions and therefore provides the exact context to the learner. We
further analyze a variant where the learner observes the realized context after
choosing the action. Finally, we demonstrate the proposed method on synthetic
and real-world datasets.Comment: Accepted at NeurIPS 201
Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications
Partial monitoring is an expressive framework for sequential decision-making
with an abundance of applications, including graph-structured and dueling
bandits, dynamic pricing and transductive feedback models. We survey and extend
recent results on the linear formulation of partial monitoring that naturally
generalizes the standard linear bandit setting. The main result is that a
single algorithm, information-directed sampling (IDS), is (nearly) worst-case
rate optimal in all finite-action games. We present a simple and unified
analysis of stochastic partial monitoring, and further extend the model to the
contextual and kernelized setting
Information Directed Sampling for Linear Partial Monitoring
Partial monitoring is a rich framework for sequential decision making under
uncertainty that generalizes many well known bandit models, including linear,
combinatorial and dueling bandits. We introduce information directed sampling
(IDS) for stochastic partial monitoring with a linear reward and observation
structure. IDS achieves adaptive worst-case regret rates that depend on precise
observability conditions of the game. Moreover, we prove lower bounds that
classify the minimax regret of all finite games into four possible regimes. IDS
achieves the optimal rate in all cases up to logarithmic factors, without
tuning any hyper-parameters. We further extend our results to the contextual
and the kernelized setting, which significantly increases the range of possible
applications
Distributionally Robust Bayesian Optimization
Robustness to distributional shift is one of the key challenges of
contemporary machine learning. Attaining such robustness is the goal of
distributionally robust optimization, which seeks a solution to an optimization
problem that is worst-case robust under a specified distributional shift of an
uncontrolled covariate. In this paper, we study such a problem when the
distributional shift is measured via the maximum mean discrepancy (MMD). For
the setting of zeroth-order, noisy optimization, we present a novel
distributionally robust Bayesian optimization algorithm (DRBO). Our algorithm
provably obtains sub-linear robust regret in various settings that differ in
how the uncertain covariate is observed. We demonstrate the robust performance
of our method on both synthetic and real-world benchmarks.Comment: Accepted at AISTATS 202
Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces
Bayesian optimization is known to be difficult to scale to high dimensions,
because the acquisition step requires solving a non-convex optimization problem
in the same search space. In order to scale the method and keep its benefits,
we propose an algorithm (LineBO) that restricts the problem to a sequence of
iteratively chosen one-dimensional sub-problems that can be solved efficiently.
We show that our algorithm converges globally and obtains a fast local rate
when the function is strongly convex. Further, if the objective has an
invariant subspace, our method automatically adapts to the effective dimension
without changing the algorithm. When combined with the SafeOpt algorithm to
solve the sub-problems, we obtain the first safe Bayesian optimization
algorithm with theoretical guarantees applicable in high-dimensional settings.
We evaluate our method on multiple synthetic benchmarks, where we obtain
competitive performance. Further, we deploy our algorithm to optimize the beam
intensity of the Swiss Free Electron Laser with up to 40 parameters while
satisfying safe operation constraints
Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off
A default assumption in reinforcement learning (RL) and optimal control is
that observations arrive at discrete time points on a fixed clock cycle. Yet,
many applications involve continuous-time systems where the time
discretization, in principle, can be managed. The impact of time discretization
on RL methods has not been fully characterized in existing theory, but a more
detailed analysis of its effect could reveal opportunities for improving
data-efficiency. We address this gap by analyzing Monte-Carlo policy evaluation
for LQR systems and uncover a fundamental trade-off between approximation and
statistical error in value estimation. Importantly, these two errors behave
differently to time discretization, leading to an optimal choice of temporal
resolution for a given data budget. These findings show that managing the
temporal resolution can provably improve policy evaluation efficiency in LQR
systems with finite data. Empirically, we demonstrate the trade-off in
numerical simulations of LQR instances and standard RL benchmarks for
non-linear continuous control.Comment: NeurIPS 202
Tuning Particle Accelerators with Safety Constraints using Bayesian Optimization
Tuning machine parameters of particle accelerators is a repetitive and
time-consuming task, that is challenging to automate. While many off-the-shelf
optimization algorithms are available, in practice their use is limited because
most methods do not account for safety-critical constraints that apply to each
iteration, including loss signals or step-size limitations. One notable
exception is safe Bayesian optimization, which is a data-driven tuning approach
for global optimization with noisy feedback. We propose and evaluate a step
size-limited variant of safe Bayesian optimization on two research faculties of
the Paul Scherrer Institut (PSI): a) the Swiss Free Electron Laser (SwissFEL)
and b) the High-Intensity Proton Accelerator (HIPA). We report promising
experimental results on both machines, tuning up to 16 parameters subject to
more than 200 constraints
O(\alpha^2 L) Radiative Corrections to Deep Inelastic ep Scattering
The leptonic QED radiative corrections are calculated in the next-to-leading
log approximation for unpolarized deeply
inelastic --scattering in the case of mixed variables. The corrections are
determined using mass factorization in the OMS--scheme for the
double--differential scattering cross sections.Comment: 10 pages LATEX, 1 style file
- …