87 research outputs found
Bandit Algorithms for Tree Search
Bandit based methods for tree search have recently gained popularity when
applied to huge trees, e.g. in the game of go (Gelly et al., 2006). The UCT
algorithm (Kocsis and Szepesvari, 2006), a tree search method based on Upper
Confidence Bounds (UCB) (Auer et al., 2002), is believed to adapt locally to
the effective smoothness of the tree. However, we show that UCT is too
``optimistic'' in some cases, leading to a regret O(exp(exp(D))) where D is the
depth of the tree. We propose alternative bandit algorithms for tree search.
First, a modification of UCT using a confidence sequence that scales
exponentially with the horizon depth is proven to have a regret O(2^D
\sqrt{n}), but does not adapt to possible smoothness in the tree. We then
analyze Flat-UCB performed on the leaves and provide a finite regret bound with
high probability. Then, we introduce a UCB-based Bandit Algorithm for Smooth
Trees which takes into account actual smoothness of the rewards for performing
efficient ``cuts'' of sub-optimal branches with high confidence. Finally, we
present an incremental tree search version which applies when the full tree is
too big (possibly infinite) to be entirely represented and show that with high
probability, essentially only the optimal branches is indefinitely developed.
We illustrate these methods on a global optimization problem of a Lipschitz
function, given noisy data
Bootstrapping Monte Carlo Tree Search with an Imperfect Heuristic
We consider the problem of using a heuristic policy to improve the value
approximation by the Upper Confidence Bound applied in Trees (UCT) algorithm in
non-adversarial settings such as planning with large-state space Markov
Decision Processes. Current improvements to UCT focus on either changing the
action selection formula at the internal nodes or the rollout policy at the
leaf nodes of the search tree. In this work, we propose to add an auxiliary arm
to each of the internal nodes, and always use the heuristic policy to roll out
simulations at the auxiliary arms. The method aims to get fast convergence to
optimal values at states where the heuristic policy is optimal, while retaining
similar approximation as the original UCT in other states. We show that
bootstrapping with the proposed method in the new algorithm, UCT-Aux, performs
better compared to the original UCT algorithm and its variants in two benchmark
experiment settings. We also examine conditions under which UCT-Aux works well.Comment: 16 pages, accepted for presentation at ECML'1
Exploration vs Exploitation vs Safety: Risk-averse Multi-Armed Bandits
Motivated by applications in energy management, this paper presents the
Multi-Armed Risk-Aware Bandit (MARAB) algorithm. With the goal of limiting the
exploration of risky arms, MARAB takes as arm quality its conditional value at
risk. When the user-supplied risk level goes to 0, the arm quality tends toward
the essential infimum of the arm distribution density, and MARAB tends toward
the MIN multi-armed bandit algorithm, aimed at the arm with maximal minimal
value. As a first contribution, this paper presents a theoretical analysis of
the MIN algorithm under mild assumptions, establishing its robustness
comparatively to UCB. The analysis is supported by extensive experimental
validation of MIN and MARAB compared to UCB and state-of-art risk-aware MAB
algorithms on artificial and real-world problems.Comment: 16 page
Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem
This paper is devoted to regret lower bounds in the classical model of
stochastic multi-armed bandit. A well-known result of Lai and Robbins, which
has then been extended by Burnetas and Katehakis, has established the presence
of a logarithmic bound for all consistent policies. We relax the notion of
consistence, and exhibit a generalisation of the logarithmic bound. We also
show the non existence of logarithmic bound in the general case of Hannan
consistency. To get these results, we study variants of popular Upper
Confidence Bounds (ucb) policies. As a by-product, we prove that it is
impossible to design an adaptive policy that would select the best of two
algorithms by taking advantage of the properties of the environment
Practical Open-Loop Optimistic Planning
We consider the problem of online planning in a Markov Decision Process when
given only access to a generative model, restricted to open-loop policies -
i.e. sequences of actions - and under budget constraint. In this setting, the
Open-Loop Optimistic Planning (OLOP) algorithm enjoys good theoretical
guarantees but is overly conservative in practice, as we show in numerical
experiments. We propose a modified version of the algorithm with tighter
upper-confidence bounds, KLOLOP, that leads to better practical performances
while retaining the sample complexity bound. Finally, we propose an efficient
implementation that significantly improves the time complexity of both
algorithms
- …