Search CORE

1,178 research outputs found

Learning Stochastic Shortest Path with Linear Function Approximation

Author: Gu Quanquan
He Jiafan
Min Yifei
Wang Tianhao
Publication venue
Publication date: 05/07/2022
Field of study

We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain an

\tilde{\mathcal{O}}(d B_{\star}^{1.5}\sqrt{K/c_{\min}})

regret. Here

K

is the number of episodes,

d

is the dimension of the feature mapping in the mixture model,

B_{\star}

bounds the expected cumulative cost of the optimal policy, and

c_{\min}>0

is the lower bound of the cost function. Our algorithm also applies to the case when

c_{\min} = 0

, and an

\tilde{\mathcal{O}}(K^{2/3})

regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves an

\tilde{\mathcal{O}}(d B_{\star}\sqrt{K/c_{\min}})

regret. In complement to the regret upper bounds, we also prove a lower bound of

\Omega(dB_{\star} \sqrt{K})

. Hence, our improved algorithm matches the lower bound up to a

1/\sqrt{c_{\min}}

factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.Comment: 46 pages, 1 figure. In ICML 202

arXiv.org e-Print Archive

Stochastic Online Shortest Path Routing: The Value of Feedback

Author: Combes Richard
Johansson Mikael
Proutiere Alexandre
Talebi M. Sadegh
Zou Zhenhua
Publication venue
Publication date: 01/01/2017
Field of study

This paper studies online shortest path routing over multi-hop networks. Link costs or delays are time-varying and modeled by independent and identically distributed random processes, whose parameters are initially unknown. The parameters, and hence the optimal path, can only be estimated by routing packets through the network and observing the realized delays. Our aim is to find a routing policy that minimizes the regret (the cumulative difference of expected delay) between the path chosen by the policy and the unknown optimal path. We formulate the problem as a combinatorial bandit optimization problem and consider several scenarios that differ in where routing decisions are made and in the information available when making the decisions. For each scenario, we derive a tight asymptotic lower bound on the regret that has to be satisfied by any online routing policy. These bounds help us to understand the performance improvements we can expect when (i) taking routing decisions at each hop rather than at the source only, and (ii) observing per-link delays rather than end-to-end path delays. In particular, we show that (i) is of no use while (ii) can have a spectacular impact. Three algorithms, with a trade-off between computational complexity and performance, are proposed. The regret upper bounds of these algorithms improve over those of the existing algorithms, and they significantly outperform state-of-the-art algorithms in numerical experiments.Comment: 18 page

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL-Rennes 1

A linear programming based heuristic framework for min-max regret combinatorial optimization problems with interval costs

Author: Andrade Rafael
Assunção Lucas
Noronha Thiago F.
Santos Andréa Cynthia
Publication venue: 'Elsevier BV'
Publication date: 28/09/2016
Field of study

This work deals with a class of problems under interval data uncertainty, namely interval robust-hard problems, composed of interval data min-max regret generalizations of classical NP-hard combinatorial problems modeled as 0-1 integer linear programming problems. These problems are more challenging than other interval data min-max regret problems, as solely computing the cost of any feasible solution requires solving an instance of an NP-hard problem. The state-of-the-art exact algorithms in the literature are based on the generation of a possibly exponential number of cuts. As each cut separation involves the resolution of an NP-hard classical optimization problem, the size of the instances that can be solved efficiently is relatively small. To smooth this issue, we present a modeling technique for interval robust-hard problems in the context of a heuristic framework. The heuristic obtains feasible solutions by exploring dual information of a linearly relaxed model associated with the classical optimization problem counterpart. Computational experiments for interval data min-max regret versions of the restricted shortest path problem and the set covering problem show that our heuristic is able to find optimal or near-optimal solutions and also improves the primal bounds obtained by a state-of-the-art exact algorithm and a 2-approximation procedure for interval data min-max regret problems

arXiv.org e-Print Archive

HAL Descartes

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Author: Fruit Ronan
Lazaric Alessandro
Pirotta Matteo
Publication venue
Publication date: 02/12/2018
Field of study

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with

S^{\texttt{C}}

communicating states,

A

actions and

\Gamma^{\texttt{C}} \leq S^{\texttt{C}}

possible communicating next states, we derive a

\widetilde{O}(D^{\texttt{C}} \sqrt{\Gamma^{\texttt{C}} S^{\texttt{C}} AT})

regret bound, where

D^{\texttt{C}}

is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

On solving discrete optimization problems with one random element under general regret functions

Author: Das S.
Ghosh D.
Mandal P.K.
Publication venue: Department of Applied Mathematics, University of Twente
Publication date: 01/01/2005
Field of study

In this paper we consider the class of stochastic discrete optimization problems in which the feasibility of a solution does not depend on the particular values the random elements in the problem take. Given a regret function, we introduce the concept of the risk associated with a solution, and define an optimal solution as one having the least possible risk. We show that for discrete optimization problems with one random element and with min-sum objective functions a least risk solution for the stochastic problem can be obtained by solving a non-stochastic counterpart where the latter is constructed by replacing the random element of the former with a suitable parameter. We show that the above surrogate is the mean if the stochastic problem has only one symmetrically distributed random element. We obtain bounds for this parameter for certain classes of asymmetric distributions and study the limiting behavior of this parameter in details under two asymptotic frameworks. \u

University of Twente Research Information