1,670 research outputs found
Stochastic Online Shortest Path Routing: The Value of Feedback
This paper studies online shortest path routing over multi-hop networks. Link
costs or delays are time-varying and modeled by independent and identically
distributed random processes, whose parameters are initially unknown. The
parameters, and hence the optimal path, can only be estimated by routing
packets through the network and observing the realized delays. Our aim is to
find a routing policy that minimizes the regret (the cumulative difference of
expected delay) between the path chosen by the policy and the unknown optimal
path. We formulate the problem as a combinatorial bandit optimization problem
and consider several scenarios that differ in where routing decisions are made
and in the information available when making the decisions. For each scenario,
we derive a tight asymptotic lower bound on the regret that has to be satisfied
by any online routing policy. These bounds help us to understand the
performance improvements we can expect when (i) taking routing decisions at
each hop rather than at the source only, and (ii) observing per-link delays
rather than end-to-end path delays. In particular, we show that (i) is of no
use while (ii) can have a spectacular impact. Three algorithms, with a
trade-off between computational complexity and performance, are proposed. The
regret upper bounds of these algorithms improve over those of the existing
algorithms, and they significantly outperform state-of-the-art algorithms in
numerical experiments.Comment: 18 page
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
- …