Online Resource Allocation in Episodic Markov Decision Processes

Abstract

This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by O~(Οβˆ’1H3/2SAT)\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT}) where ρ∈(0,1)\rho\in(0,1) is the budget parameter, HH is the length of the horizon, SS and AA are the numbers of states and actions, and TT is the number of episodes. For the decide-then-observe regime, we show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by O~(Οβˆ’1H3/2SAT)\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT}) with high probability. We test the numerical efficiency of our method for a variant of the resource-constrained inventory management problem

    Similar works

    Full text

    thumbnail-image

    Available Versions