1,976 research outputs found
Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with
unknown reward models. At each time, a player selects one arm to play, aiming
to maximize the total expected reward over a horizon of length T. An approach
based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is
developed for constructing sequential arm selection policies. It is shown that
for all light-tailed reward distributions, DSEE achieves the optimal
logarithmic order of the regret, where regret is defined as the total expected
reward loss against the ideal case with known reward models. For heavy-tailed
reward distributions, DSEE achieves O(T^1/p) regret when the moments of the
reward distributions exist up to the pth order for 1<p<=2 and O(T^1/(1+p/2))
for p>2. With the knowledge of an upperbound on a finite moment of the
heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret
order. The proposed DSEE approach complements existing work on MAB by providing
corresponding results for general reward distributions. Furthermore, with a
clearly defined tunable parameter-the cardinality of the exploration sequence,
the DSEE approach is easily extendable to variations of MAB, including MAB with
various objectives, decentralized MAB with multiple players and incomplete
reward observations under collisions, MAB with unknown Markov dynamics, and
combinatorial MAB with dependent arms that often arise in network optimization
problems such as the shortest path, the minimum spanning, and the dominating
set problems under unknown random weights.Comment: 22 pages, 2 figure
Stochastic Online Shortest Path Routing: The Value of Feedback
This paper studies online shortest path routing over multi-hop networks. Link
costs or delays are time-varying and modeled by independent and identically
distributed random processes, whose parameters are initially unknown. The
parameters, and hence the optimal path, can only be estimated by routing
packets through the network and observing the realized delays. Our aim is to
find a routing policy that minimizes the regret (the cumulative difference of
expected delay) between the path chosen by the policy and the unknown optimal
path. We formulate the problem as a combinatorial bandit optimization problem
and consider several scenarios that differ in where routing decisions are made
and in the information available when making the decisions. For each scenario,
we derive a tight asymptotic lower bound on the regret that has to be satisfied
by any online routing policy. These bounds help us to understand the
performance improvements we can expect when (i) taking routing decisions at
each hop rather than at the source only, and (ii) observing per-link delays
rather than end-to-end path delays. In particular, we show that (i) is of no
use while (ii) can have a spectacular impact. Three algorithms, with a
trade-off between computational complexity and performance, are proposed. The
regret upper bounds of these algorithms improve over those of the existing
algorithms, and they significantly outperform state-of-the-art algorithms in
numerical experiments.Comment: 18 page
Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards
In the classic multi-armed bandits problem, the goal is to have a policy for
dynamically operating arms that each yield stochastic rewards with unknown
means. The key metric of interest is regret, defined as the gap between the
expected total reward accumulated by an omniscient player that knows the reward
means for each arm, and the expected total reward accumulated by the given
policy. The policies presented in prior work have storage, computation and
regret all growing linearly with the number of arms, which is not scalable when
the number of arms is large. We consider in this work a broad class of
multi-armed bandits with dependent arms that yield rewards as a linear
combination of a set of unknown parameters. For this general framework, we
present efficient policies that are shown to achieve regret that grows
logarithmically with time, and polynomially in the number of unknown parameters
(even though the number of dependent arms may grow exponentially). Furthermore,
these policies only require storage that grows linearly in the number of
unknown parameters. We show that this generalization is broadly applicable and
useful for many interesting tasks in networks that can be formulated as
tractable combinatorial optimization problems with linear objective functions,
such as maximum weight matching, shortest path, and minimum spanning tree
computations
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
While designing the state space of an MDP, it is common to include states
that are transient or not reachable by any policy (e.g., in mountain car, the
product space of speed and position contains configurations that are not
physically reachable). This leads to defining weakly-communicating or
multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able
to perform efficient exploration-exploitation in any finite Markov Decision
Process (MDP) without requiring any form of prior knowledge. In particular, for
any MDP with communicating states, actions and
possible communicating next states,
we derive a regret bound, where is the diameter
(i.e., the longest shortest path) of the communicating part of the MDP. This is
in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that
suffer linear regret in weakly-communicating MDPs, as well as posterior
sampling or regularised algorithms (e.g., REGAL), which require prior knowledge
on the bias span of the optimal policy to bias the exploration to achieve
sub-linear regret. We also prove that in weakly-communicating MDPs, no
algorithm can ever achieve a logarithmic growth of the regret without first
suffering a linear regret for a number of steps that is exponential in the
parameters of the MDP. Finally, we report numerical simulations supporting our
theoretical findings and showing how TUCRL overcomes the limitations of the
state-of-the-art
- …