1,218 research outputs found
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
The Sample-Complexity of General Reinforcement Learning
We present a new algorithm for general reinforcement learning where the true
environment is known to belong to a finite class of N arbitrary models. The
algorithm is shown to be near-optimal for all but O(N log^2 N) time-steps with
high probability. Infinite classes are also considered where we show that
compactness is a key criterion for determining the existence of uniform
sample-complexity bounds. A matching lower bound is given for the finite case.Comment: 16 page
Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?
We study learning algorithms for the classical Markovian bandit problem with
discount. We explain how to adapt PSRL [24] and UCRL2 [2] to exploit the
problem structure. These variants are called MB-PSRL and MB-UCRL2. While the
regret bound and runtime of vanilla implementations of PSRL and UCRL2 are
exponential in the number of bandits, we show that the episodic regret of
MB-PSRL and MB-UCRL2 is where is the number of
episodes, is the number of bandits and is the number of states of each
bandit (the exact bound in S, n and K is given in the paper). Up to a factor
, this matches the lower bound of that we also
derive in the paper. MB-PSRL is also computationally efficient: its runtime is
linear in the number of bandits. We further show that this linear runtime
cannot be achieved by adapting classical non-Bayesian algorithms such as UCRL2
or UCBVI to Markovian bandit problems. Finally, we perform numerical
experiments that confirm that MB-PSRL outperforms other existing algorithms in
practice, both in terms of regret and of computation time
- …