69 research outputs found
Extended UCB Policy for Multi-Armed Bandit with Light-Tailed Reward Distributions
We consider the multi-armed bandit problems in which a player aims to accrue
reward by sequentially playing a given set of arms with unknown reward
statistics. In the classic work, policies were proposed to achieve the optimal
logarithmic regret order for some special classes of light-tailed reward
distributions, e.g., Auer et al.'s UCB1 index policy for reward distributions
with finite support. In this paper, we extend Auer et al.'s UCB1 index policy
to achieve the optimal logarithmic regret order for all light-tailed (or
equivalently, locally sub-Gaussian) reward distributions defined by the (local)
existence of the moment-generating function.Comment: 9 pages, 1 figur
Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with
unknown reward models. At each time, a player selects one arm to play, aiming
to maximize the total expected reward over a horizon of length T. An approach
based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is
developed for constructing sequential arm selection policies. It is shown that
for all light-tailed reward distributions, DSEE achieves the optimal
logarithmic order of the regret, where regret is defined as the total expected
reward loss against the ideal case with known reward models. For heavy-tailed
reward distributions, DSEE achieves O(T^1/p) regret when the moments of the
reward distributions exist up to the pth order for 1<p<=2 and O(T^1/(1+p/2))
for p>2. With the knowledge of an upperbound on a finite moment of the
heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret
order. The proposed DSEE approach complements existing work on MAB by providing
corresponding results for general reward distributions. Furthermore, with a
clearly defined tunable parameter-the cardinality of the exploration sequence,
the DSEE approach is easily extendable to variations of MAB, including MAB with
various objectives, decentralized MAB with multiple players and incomplete
reward observations under collisions, MAB with unknown Markov dynamics, and
combinatorial MAB with dependent arms that often arise in network optimization
problems such as the shortest path, the minimum spanning, and the dominating
set problems under unknown random weights.Comment: 22 pages, 2 figure
Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk
We study the trade-off between expectation and tail risk for regret
distribution in the stochastic multi-armed bandit problem. We fully
characterize the interplay among three desired properties for policy design:
worst-case optimality, instance-dependent consistency, and light-tailed risk.
We show how the order of expected regret exactly affects the decaying rate of
the regret tail probability for both the worst-case and instance-dependent
scenario. A novel policy is proposed to characterize the optimal regret tail
probability for any regret threshold. Concretely, for any given and , our policy achieves a worst-case expected regret
of (we call it -optimal) and an instance-dependent
expected regret of (we call it -consistent), while
enjoys a probability of incurring an regret
( in the worst-case scenario and in the
instance-dependent scenario) that decays exponentially with a polynomial
term. Such decaying rate is proved to be best achievable. Moreover, we discover
an intrinsic gap of the optimal tail rate under the instance-dependent scenario
between whether the time horizon is known a priori or not. Interestingly,
when it comes to the worst-case scenario, this gap disappears. Finally, we
extend our proposed policy design to (1) a stochastic multi-armed bandit
setting with non-stationary baseline rewards, and (2) a stochastic linear
bandit setting. Our results reveal insights on the trade-off between regret
expectation and regret tail risk for both worst-case and instance-dependent
scenarios, indicating that more sub-optimality and inconsistency leave space
for more light-tailed risk of incurring a large regret, and that knowing the
planning horizon in advance can make a difference on alleviating tail risks
Satisficing in multi-armed bandit problems
Satisficing is a relaxation of maximizing and allows for less risky decision
making in the face of uncertainty. We propose two sets of satisficing
objectives for the multi-armed bandit problem, where the objective is to
achieve reward-based decision-making performance above a given threshold. We
show that these new problems are equivalent to various standard multi-armed
bandit problems with maximizing objectives and use the equivalence to find
bounds on performance. The different objectives can result in qualitatively
different behavior; for example, agents explore their options continually in
one case and only a finite number of times in another. For the case of Gaussian
rewards we show an additional equivalence between the two sets of satisficing
objectives that allows algorithms developed for one set to be applied to the
other. We then develop variants of the Upper Credible Limit (UCL) algorithm
that solve the problems with satisficing objectives and show that these
modified UCL algorithms achieve efficient satisficing performance.Comment: To appear in IEEE Transactions on Automatic Contro
Poisson process bandits:Sequential models and algorithms for maximising the detection of point data
In numerous settings in areas as diverse as security, ecology, astronomy, and logistics, it is desirable to optimally deploy a limited resource to observe events, which may be modelled as point data arising according to a Non-homogeneous Poisson process. Increasingly, thanks to developments in mobile and adaptive technologies, it is possible to update a deployment of such resource and gather feedback on the quality of multiple actions. Such a capability presents the opportunity to learn, and with it a classic problem in operations research and machine learning - the explorationexploitation dilemma. To perform optimally, how should investigative choices which explore the value of poorly understood actions and optimising choices which choose actions known to be of a high value be balanced? Effective techniques exist to resolve this dilemma in simpler settings, but the Poisson process data brings new challenges. In this thesis, effective solution methods for the problem of sequentially deploying resource are developed, via a combination of efficient inference schemes, bespoke optimisation approaches, and advanced sequential decision-making strategies. Furthermore, extensive theoretical work provides strong guarantees on the performance of the proposed solution methods and an understanding of the challenges of this problem and more complex extensions. In particular, Upper Confidence Bound and Thompson Sampling (TS) approaches are derived for combinatorial and continuum-armed bandit versions of the problem, with accompanying analysis displaying that the regret of the approaches is of optimal order. A broader understanding of the performance of TS based on non-parametric models for smooth reward functions is developed, and new posterior contraction results for the Gaussian Cox Process, a popular Bayesian non-parametric model of point data, are derived. These results point to effective strategies for more challenging variants of the event detection problem, and more generally advance the understanding of bandit decision-making with complex data structures
- …