177 research outputs found
Exploration vs Exploitation vs Safety: Risk-averse Multi-Armed Bandits
Motivated by applications in energy management, this paper presents the
Multi-Armed Risk-Aware Bandit (MARAB) algorithm. With the goal of limiting the
exploration of risky arms, MARAB takes as arm quality its conditional value at
risk. When the user-supplied risk level goes to 0, the arm quality tends toward
the essential infimum of the arm distribution density, and MARAB tends toward
the MIN multi-armed bandit algorithm, aimed at the arm with maximal minimal
value. As a first contribution, this paper presents a theoretical analysis of
the MIN algorithm under mild assumptions, establishing its robustness
comparatively to UCB. The analysis is supported by extensive experimental
validation of MIN and MARAB compared to UCB and state-of-art risk-aware MAB
algorithms on artificial and real-world problems.Comment: 16 page
Functional Bandits
We introduce the functional bandit problem, where the objective is to find an
arm that optimises a known functional of the unknown arm-reward distributions.
These problems arise in many settings such as maximum entropy methods in
natural language processing, and risk-averse decision-making, but current
best-arm identification techniques fail in these domains. We propose a new
approach, that combines functional estimation and arm elimination, to tackle
this problem. This method achieves provably efficient performance guarantees.
In addition, we illustrate this method on a number of important functionals in
risk management and information theory, and refine our generic theoretical
results in those cases
Satisficing in multi-armed bandit problems
Satisficing is a relaxation of maximizing and allows for less risky decision
making in the face of uncertainty. We propose two sets of satisficing
objectives for the multi-armed bandit problem, where the objective is to
achieve reward-based decision-making performance above a given threshold. We
show that these new problems are equivalent to various standard multi-armed
bandit problems with maximizing objectives and use the equivalence to find
bounds on performance. The different objectives can result in qualitatively
different behavior; for example, agents explore their options continually in
one case and only a finite number of times in another. For the case of Gaussian
rewards we show an additional equivalence between the two sets of satisficing
objectives that allows algorithms developed for one set to be applied to the
other. We then develop variants of the Upper Credible Limit (UCL) algorithm
that solve the problems with satisficing objectives and show that these
modified UCL algorithms achieve efficient satisficing performance.Comment: To appear in IEEE Transactions on Automatic Contro
A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits
In a typical stochastic multi-armed bandit problem, the objective is often to
maximize the expected sum of rewards over some time horizon . While the
choice of a strategy that accomplishes that is optimal with no additional
information, it is no longer the case when provided additional
environment-specific knowledge. In particular, in areas of high volatility like
healthcare or finance, a naive reward maximization approach often does not
accurately capture the complexity of the learning problem and results in
unreliable solutions. To tackle problems of this nature, we propose a framework
of adaptive risk-aware strategies that operate in non-stationary environments.
Our framework incorporates various risk measures prevalent in the literature to
map multiple families of multi-armed bandit algorithms into a risk-sensitive
setting. In addition, we equip the resulting algorithms with the Restarted
Bayesian Online Change-Point Detection (R-BOCPD) algorithm and impose a
(tunable) forced exploration strategy to detect local (per-arm) switches. We
provide finite-time theoretical guarantees and an asymptotic regret bound of
order up to time horizon with the total
number of change-points. In practice, our framework compares favorably to the
state-of-the-art in both synthetic and real-world environments and manages to
perform efficiently with respect to both risk-sensitivity and non-stationarity
Risk-aware linear bandits with convex loss
In decision-making problems such as the multi-armed bandit, an agent learns
sequentially by optimizing a certain feedback. While the mean reward criterion
has been extensively studied, other measures that reflect an aversion to
adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR),
can be of interest for critical applications (healthcare, agriculture).
Algorithms have been proposed for such risk-aware measures under bandit
feedback without contextual information. In this work, we study contextual
bandits where such risk measures can be elicited as linear functions of the
contexts through the minimization of a convex loss. A typical example that fits
within this framework is the expectile measure, which is obtained as the
solution of an asymmetric least-square problem. Using the method of mixtures
for supermartingales, we derive confidence sequences for the estimation of such
risk measures. We then propose an optimistic UCB algorithm to learn optimal
risk-aware actions, with regret guarantees similar to those of generalized
linear bandits. This approach requires solving a convex problem at each round
of the algorithm, which we can relax by allowing only approximated solution
obtained by online gradient descent, at the cost of slightly higher regret. We
conclude by evaluating the resulting algorithms on numerical experiments
The Fragility of Optimized Bandit Algorithms
Much of the literature on optimal design of bandit algorithms is based on
minimization of expected regret. It is well known that designs that are optimal
over certain exponential families can achieve expected regret that grows
logarithmically in the number of arm plays, at a rate governed by the
Lai-Robbins lower bound. In this paper, we show that when one uses such
optimized designs, the regret distribution of the associated algorithms
necessarily has a very heavy tail, specifically, that of a truncated Cauchy
distribution. Furthermore, for , the 'th moment of the regret
distribution grows much faster than poly-logarithmically, in particular as a
power of the total number of arm plays. We show that optimized UCB bandit
designs are also fragile in an additional sense, namely when the problem is
even slightly mis-specified, the regret can grow much faster than the
conventional theory suggests. Our arguments are based on standard
change-of-measure ideas, and indicate that the most likely way that regret
becomes larger than expected is when the optimal arm returns below-average
rewards in the first few arm plays, thereby causing the algorithm to believe
that the arm is sub-optimal. To alleviate the fragility issues exposed, we show
that UCB algorithms can be modified so as to ensure a desired degree of
robustness to mis-specification. In doing so, we also provide a sharp trade-off
between the amount of UCB exploration and the tail exponent of the resulting
regret distribution
Risk-averse multi-armed bandits and game theory
The multi-armed bandit (MAB) and game theory literature is mainly focused on the expected cumulative reward and the expected payoffs in a game, respectively. In contrast, the rewards and the payoffs are often random variables whose expected values only capture a vague idea of the overall distribution. The focus of this dissertation is to study the fundamental limits of the existing bandits and game theory problems in a risk-averse framework and propose new ideas that address the shortcomings. The author believes that human beings are mostly risk-averse, so studying multi-armed bandits and game theory from the point of view of risk aversion, rather than expected reward/payoff, better captures reality. In this manner, a specific class of multi-armed bandits, called explore-then-commit bandits, and stochastic games are studied in this dissertation, which are based on the notion of Risk-Averse Best Action Decision with Incomplete Information (R-ABADI, Abadi is the maiden name of the author's mother). The goal of the classical multi-armed bandits is to exploit the arm with the maximum score defined as the expected value of the arm reward. Instead, we propose a new definition of score that is derived from the joint distribution of all arm rewards and captures the reward of an arm relative to those of all other arms. We use a similar idea for games and propose a risk-averse R-ABADI equilibrium in game theory that is possibly different from the Nash equilibrium. The payoff distributions are taken into account to derive the risk-averse equilibrium, while the expected payoffs are used to find the Nash equilibrium. The fundamental properties of games, e.g. pure and mixed risk-averse R-ABADI equilibrium and strict dominance, are studied in the new framework and the results are expanded to finite-time games. Furthermore, the stochastic congestion games are studied from a risk-averse perspective and three classes of equilibria are proposed for such games. It is shown by examples that the risk-averse behavior of travelers in a stochastic congestion game can improve the price of anarchy in Pigou and Braess networks. Furthermore, the Braess paradox does not occur to the extent proposed originally when travelers are risk-averse.
We also study an online affinity scheduling problem with no prior knowledge of the task arrival rates and processing rates of different task types on different servers. We propose the Blind GB-PANDAS algorithm that utilizes an exploration-exploitation scheme to load balance incoming tasks on servers in an online fashion. We prove that Blind GB-PANDAS is throughput optimal, i.e. it stabilizes the system as long as the task arrival rates are inside the capacity region. The Blind GB-PANDAS algorithm is compared to FCFS, Max-Weight, and c-mu-rule algorithms in terms of average task completion time through simulations, where the same exploration-exploitation approach as Blind GB-PANDAS is used for Max-Weight and c--rule. The extensive simulations show that the Blind GB-PANDAS algorithm conspicuously outperforms the three other algorithms at high loads
- …