4 research outputs found
Large-Scale Markov Decision Problems via the Linear Programming Dual
We consider the problem of controlling a fully specified Markov decision
process (MDP), also known as the planning problem, when the state space is very
large and calculating the optimal policy is intractable. Instead, we pursue the
more modest goal of optimizing over some small family of policies.
Specifically, we show that the family of policies associated with a
low-dimensional approximation of occupancy measures yields a tractable
optimization. Moreover, we propose an efficient algorithm, scaling with the
size of the subspace but not the state space, that is able to find a policy
with low excess loss relative to the best policy in this class. To the best of
our knowledge, such results did not exist in the literature previously. We
bound excess loss in the average cost and discounted cost cases, which are
treated separately. Preliminary experiments show the effectiveness of the
proposed algorithms in a queueing application.Comment: 53 pages. arXiv admin note: text overlap with arXiv:1402.676
Large Scale Markov Decision Processes with Changing Rewards
We consider Markov Decision Processes (MDPs) where the rewards are unknown
and may change in an adversarial manner. We provide an algorithm that achieves
state-of-the-art regret bound of ,
where is the state space, is the action space, is the mixing
time of the MDP, and is the number of periods. The algorithm's
computational complexity is polynomial in and per period. We then
consider a setting often encountered in practice, where the state space of the
MDP is too large to allow for exact solutions. By approximating the
state-action occupancy measures with a linear architecture of dimension
, we propose a modified algorithm with computational complexity
polynomial in . We also prove a regret bound for this modified algorithm,
which to the best of our knowledge this is the first
regret bound for large scale MDPs with changing rewards
A maximum-entropy approach to off-policy evaluation in average-reward MDPs
This work focuses on off-policy evaluation (OPE) with function approximation
in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs
that are ergodic and linear (i.e. where rewards and dynamics are linear in some
known features), we provide the first finite-sample OPE error bound, extending
existing results beyond the episodic and discounted cases. In a more general
setting, when the feature dynamics are approximately linear and for arbitrary
rewards, we propose a new approach for estimating stationary distributions with
function approximation. We formulate this problem as finding the
maximum-entropy distribution subject to matching feature expectations under
empirical dynamics. We show that this results in an exponential-family
distribution whose sufficient statistics are the features, paralleling
maximum-entropy approaches in supervised learning. We demonstrate the
effectiveness of the proposed OPE approaches in multiple environments
Efficient Planning in Large MDPs with Weak Linear Function Approximation
Large-scale Markov decision processes (MDPs) require planning algorithms with
runtime independent of the number of states of the MDP. We consider the
planning problem in MDPs using linear value function approximation with only
weak requirements: low approximation error for the optimal value function, and
a small set of "core" states whose features span those of other states. In
particular, we make no assumptions about the representability of policies or
value functions of non-optimal policies. Our algorithm produces almost-optimal
actions for any state using a generative oracle (simulator) for the MDP, while
its computation time scales polynomially with the number of features, core
states, and actions and the effective horizon.Comment: 12 pages and appendix (10 pages). Submitted to the 34th Conference on
Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canad