3 research outputs found
The Exploration-Exploitation Trade-Off in Sequential Decision Making Problems
Sequential decision making problems require an agent to repeatedly choose between
a series of actions. Common to such problems is the exploration-exploitation
trade-off, where an agent must choose between the action expected to yield the best
reward (exploitation) or trying an alternative action for potential future benefit (exploration).
The main focus of this thesis is to understand in more detail the role this
trade-off plays in various important sequential decision making problems, in terms
of maximising finite-time reward.
The most common and best studied abstraction of the exploration-exploitation
trade-off is the classic multi-armed bandit problem. In this thesis we study several
important extensions that are more suitable than the classic problem to real-world
applications. These extensions include scenarios where the rewards for actions
change over time or the presence of other agents must be repeatedly considered. In
these contexts, the exploration-exploitation trade-off has a more complicated role
in terms of maximising finite-time performance. For example, the amount of exploration
required will constantly change in a dynamic decision problem, in multiagent
problems agents can explore by communication, and in repeated games, the
exploration-exploitation trade-off must be jointly considered with game theoretic
reasoning.
Existing techniques for balancing exploration-exploitation are focused on achieving
desirable asymptotic behaviour and are in general only applicable to basic decision
problems. The most flexible state-of-the-art approaches, Î-greedy and Î-first,
require exploration parameters to be set a priori, the optimal values of which are
highly dependent on the problem faced. To overcome this, we construct a novel algorithm, Î-ADAPT, which has no exploration parameters and can adapt exploration
on-line for a wide range of problems. Î-ADAPT is built on newly proven theoretical
properties of the Î-first policy and we demonstrate that Î-ADAPT can accurately
learn not only how much to explore, but also when and which actions to explore