2,054 research outputs found

    Online Markov decision processes under bandit feedback

    Get PDF
    Abstract—We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of O(T 2/3 ln T). In this paper, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O ( T 1/2 ln T) , giving the first rigorously proven, essentially tight regret bound for the problem. /01*2';<= ';<= ';<= ' r

    Stochastic Online Shortest Path Routing: The Value of Feedback

    Full text link
    This paper studies online shortest path routing over multi-hop networks. Link costs or delays are time-varying and modeled by independent and identically distributed random processes, whose parameters are initially unknown. The parameters, and hence the optimal path, can only be estimated by routing packets through the network and observing the realized delays. Our aim is to find a routing policy that minimizes the regret (the cumulative difference of expected delay) between the path chosen by the policy and the unknown optimal path. We formulate the problem as a combinatorial bandit optimization problem and consider several scenarios that differ in where routing decisions are made and in the information available when making the decisions. For each scenario, we derive a tight asymptotic lower bound on the regret that has to be satisfied by any online routing policy. These bounds help us to understand the performance improvements we can expect when (i) taking routing decisions at each hop rather than at the source only, and (ii) observing per-link delays rather than end-to-end path delays. In particular, we show that (i) is of no use while (ii) can have a spectacular impact. Three algorithms, with a trade-off between computational complexity and performance, are proposed. The regret upper bounds of these algorithms improve over those of the existing algorithms, and they significantly outperform state-of-the-art algorithms in numerical experiments.Comment: 18 page

    Learning Adversarial Markov Decision Processes with Delayed Feedback

    Full text link
    Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode kk are revealed to the learner only in the end of episode k+dkk + d^k, where the delays dkd^k are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of K+D\sqrt{K + D} under full-information feedback, where KK is the number of episodes and D=kdkD = \sum_{k} d^k is the total delay. Under bandit feedback, we prove similar K+D\sqrt{K + D} regret assuming the costs are stochastic, and (K+D)2/3(K + D)^{2/3} regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.Comment: AAAI 202

    Chasing Ghosts: Competing with Stateful Policies

    Full text link
    We consider sequential decision making in a setting where regret is measured with respect to a set of stateful reference policies, and feedback is limited to observing the rewards of the actions performed (the so called "bandit" setting). If either the reference policies are stateless rather than stateful, or the feedback includes the rewards of all actions (the so called "expert" setting), previous work shows that the optimal regret grows like Θ(T)\Theta(\sqrt{T}) in terms of the number of decision rounds TT. The difficulty in our setting is that the decision maker unavoidably loses track of the internal states of the reference policies, and thus cannot reliably attribute rewards observed in a certain round to any of the reference policies. In fact, in this setting it is impossible for the algorithm to estimate which policy gives the highest (or even approximately highest) total reward. Nevertheless, we design an algorithm that achieves expected regret that is sublinear in TT, of the form O(T/log1/4T)O( T/\log^{1/4}{T}). Our algorithm is based on a certain local repetition lemma that may be of independent interest. We also show that no algorithm can guarantee expected regret better than O(T/log3/2T)O( T/\log^{3/2} T)
    corecore