1,498 research outputs found
Q-learning with Nearest Neighbors
We consider model-free reinforcement learning for infinite-horizon discounted
Markov Decision Processes (MDPs) with a continuous state space and unknown
transition kernel, when only a single sample path under an arbitrary policy of
the system is available. We consider the Nearest Neighbor Q-Learning (NNQL)
algorithm to learn the optimal Q function using nearest neighbor regression
method. As the main contribution, we provide tight finite sample analysis of
the convergence rate. In particular, for MDPs with a -dimensional state
space and the discounted factor , given an arbitrary sample
path with "covering time" , we establish that the algorithm is guaranteed
to output an -accurate estimate of the optimal Q-function using
samples. For instance, for a
well-behaved MDP, the covering time of the sample path under the purely random
policy scales as so the sample
complexity scales as Indeed, we
establish a lower bound that argues that the dependence of is necessary.Comment: Accepted to NIPS 201
Solving Factored MDPs with Hybrid State and Action Variables
Efficient representations and solutions for large decision problems with
continuous and discrete variables are among the most important challenges faced
by the designers of automated decision support systems. In this paper, we
describe a novel hybrid factored Markov decision process (MDP) model that
allows for a compact representation of these problems, and a new hybrid
approximate linear programming (HALP) framework that permits their efficient
solutions. The central idea of HALP is to approximate the optimal value
function by a linear combination of basis functions and optimize its weights by
linear programming. We analyze both theoretical and computational aspects of
this approach, and demonstrate its scale-up potential on several hybrid
optimization problems
Multi-Objective Model Checking of Markov Decision Processes
We study and provide efficient algorithms for multi-objective model checking
problems for Markov Decision Processes (MDPs). Given an MDP, M, and given
multiple linear-time (\omega -regular or LTL) properties \varphi\_i, and
probabilities r\_i \epsilon [0,1], i=1,...,k, we ask whether there exists a
strategy \sigma for the controller such that, for all i, the probability that a
trajectory of M controlled by \sigma satisfies \varphi\_i is at least r\_i. We
provide an algorithm that decides whether there exists such a strategy and if
so produces it, and which runs in time polynomial in the size of the MDP. Such
a strategy may require the use of both randomization and memory. We also
consider more general multi-objective \omega -regular queries, which we
motivate with an application to assume-guarantee compositional reasoning for
probabilistic systems.
Note that there can be trade-offs between different properties: satisfying
property \varphi\_1 with high probability may necessitate satisfying \varphi\_2
with low probability. Viewing this as a multi-objective optimization problem,
we want information about the "trade-off curve" or Pareto curve for maximizing
the probabilities of different properties. We show that one can compute an
approximate Pareto curve with respect to a set of \omega -regular properties in
time polynomial in the size of the MDP.
Our quantitative upper bounds use LP methods. We also study qualitative
multi-objective model checking problems, and we show that these can be analysed
by purely graph-theoretic methods, even though the strategies may still require
both randomization and memory.Comment: 21 pages, 2 figure
- …