1,178 research outputs found
Learning Stochastic Shortest Path with Linear Function Approximation
We study the stochastic shortest path (SSP) problem in reinforcement learning
with linear function approximation, where the transition kernel is represented
as a linear mixture of unknown models. We call this class of SSP problems as
linear mixture SSPs. We propose a novel algorithm with Hoeffding-type
confidence sets for learning the linear mixture SSP, which can attain an
regret. Here is
the number of episodes, is the dimension of the feature mapping in the
mixture model, bounds the expected cumulative cost of the optimal
policy, and is the lower bound of the cost function. Our algorithm
also applies to the case when , and an
regret is guaranteed. To the best of our
knowledge, this is the first algorithm with a sublinear regret guarantee for
learning linear mixture SSP. Moreover, we design a refined Bernstein-type
confidence set and propose an improved algorithm, which provably achieves an
regret. In complement to
the regret upper bounds, we also prove a lower bound of . Hence, our improved algorithm matches the lower bound up to a
factor and poly-logarithmic factors, achieving a
near-optimal regret guarantee.Comment: 46 pages, 1 figure. In ICML 202
Stochastic Online Shortest Path Routing: The Value of Feedback
This paper studies online shortest path routing over multi-hop networks. Link
costs or delays are time-varying and modeled by independent and identically
distributed random processes, whose parameters are initially unknown. The
parameters, and hence the optimal path, can only be estimated by routing
packets through the network and observing the realized delays. Our aim is to
find a routing policy that minimizes the regret (the cumulative difference of
expected delay) between the path chosen by the policy and the unknown optimal
path. We formulate the problem as a combinatorial bandit optimization problem
and consider several scenarios that differ in where routing decisions are made
and in the information available when making the decisions. For each scenario,
we derive a tight asymptotic lower bound on the regret that has to be satisfied
by any online routing policy. These bounds help us to understand the
performance improvements we can expect when (i) taking routing decisions at
each hop rather than at the source only, and (ii) observing per-link delays
rather than end-to-end path delays. In particular, we show that (i) is of no
use while (ii) can have a spectacular impact. Three algorithms, with a
trade-off between computational complexity and performance, are proposed. The
regret upper bounds of these algorithms improve over those of the existing
algorithms, and they significantly outperform state-of-the-art algorithms in
numerical experiments.Comment: 18 page
A linear programming based heuristic framework for min-max regret combinatorial optimization problems with interval costs
This work deals with a class of problems under interval data uncertainty,
namely interval robust-hard problems, composed of interval data min-max regret
generalizations of classical NP-hard combinatorial problems modeled as 0-1
integer linear programming problems. These problems are more challenging than
other interval data min-max regret problems, as solely computing the cost of
any feasible solution requires solving an instance of an NP-hard problem. The
state-of-the-art exact algorithms in the literature are based on the generation
of a possibly exponential number of cuts. As each cut separation involves the
resolution of an NP-hard classical optimization problem, the size of the
instances that can be solved efficiently is relatively small. To smooth this
issue, we present a modeling technique for interval robust-hard problems in the
context of a heuristic framework. The heuristic obtains feasible solutions by
exploring dual information of a linearly relaxed model associated with the
classical optimization problem counterpart. Computational experiments for
interval data min-max regret versions of the restricted shortest path problem
and the set covering problem show that our heuristic is able to find optimal or
near-optimal solutions and also improves the primal bounds obtained by a
state-of-the-art exact algorithm and a 2-approximation procedure for interval
data min-max regret problems
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
While designing the state space of an MDP, it is common to include states
that are transient or not reachable by any policy (e.g., in mountain car, the
product space of speed and position contains configurations that are not
physically reachable). This leads to defining weakly-communicating or
multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able
to perform efficient exploration-exploitation in any finite Markov Decision
Process (MDP) without requiring any form of prior knowledge. In particular, for
any MDP with communicating states, actions and
possible communicating next states,
we derive a regret bound, where is the diameter
(i.e., the longest shortest path) of the communicating part of the MDP. This is
in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that
suffer linear regret in weakly-communicating MDPs, as well as posterior
sampling or regularised algorithms (e.g., REGAL), which require prior knowledge
on the bias span of the optimal policy to bias the exploration to achieve
sub-linear regret. We also prove that in weakly-communicating MDPs, no
algorithm can ever achieve a logarithmic growth of the regret without first
suffering a linear regret for a number of steps that is exponential in the
parameters of the MDP. Finally, we report numerical simulations supporting our
theoretical findings and showing how TUCRL overcomes the limitations of the
state-of-the-art
On solving discrete optimization problems with one random element under general regret functions
In this paper we consider the class of stochastic discrete optimization problems in which the feasibility of a solution does not depend on the particular values the random elements in the problem take. Given a regret function, we introduce the concept of the risk associated with a solution, and define an optimal solution as one having the least possible risk. We show that for discrete optimization problems with one random element and with min-sum objective functions a least risk solution for the stochastic problem can be obtained by solving a non-stochastic counterpart where the latter is constructed by replacing the random element of the former with a suitable parameter. We show that the above surrogate is the mean if the stochastic problem has only one symmetrically distributed random element. We obtain bounds for this parameter for certain classes of asymmetric distributions and study the limiting behavior of this parameter in details under two asymptotic frameworks. \u
- …