Bandit games consist of single-state environments in which an agent must sequentially
choose actions to take, for which rewards are given. The objective being to maximise
the cumulated reward, the agent naturally seeks to build a model of the relationship
between actions and rewards. The agent must both choose uncertain actions in order
to improve its model (exploration), and actions that are believed to yield high rewards
according to the model (exploitation). The choice of an action to take is called a play
of an arm of the bandit, and the total number of plays may or may not be known in
advance.
Algorithms designed to handle the exploration-exploitation dilemma were initially
motivated by problems with rather small numbers of actions. But the ideas they were
based on have been extended to cases where the number of actions to choose from is
much larger than the maximum possible number of plays. Several problems fall into this
setting, such as information retrieval with relevance feedback, where the system must
learn what a user is looking for while serving relevant documents often enough, but
also global optimisation, where the search for an optimum is done by selecting where
to acquire potentially expensive samples of a target function. All have in common the
search of large spaces.
In this thesis, we focus on an algorithm based on the Gaussian Processes probabilistic
model, often used in Bayesian optimisation, and the Upper Confidence Bound
action-selection heuristic that is popular in bandit algorithms. In addition to demonstrating
the advantages of the GP-UCB algorithm on an image retrieval problem, we
show how it can be adapted in order to search tree-structured spaces. We provide an
efficient implementation, theoretical guarantees on the algorithm's performance, and
empirical evidence that it handles large branching factors better than previous bandit-based
algorithms, on synthetic trees