51,546 research outputs found
Monte Carlo Matrix Inversion Policy Evaluation
In 1950, Forsythe and Leibler (1950) introduced a statistical technique for
finding the inverse of a matrix by characterizing the elements of the matrix
inverse as expected values of a sequence of random walks. Barto and Duff (1994)
subsequently showed relations between this technique and standard dynamic
programming and temporal differencing methods. The advantage of the Monte Carlo
matrix inversion (MCMI) approach is that it scales better with respect to
state-space size than alternative techniques. In this paper, we introduce an
algorithm for performing reinforcement learning policy evaluation using MCMI.
We demonstrate that MCMI improves on runtime over a maximum likelihood
model-based policy evaluation approach and on both runtime and accuracy over
the temporal differencing (TD) policy evaluation approach. We further improve
on MCMI policy evaluation by adding an importance sampling technique to our
algorithm to reduce the variance of our estimator. Lastly, we illustrate
techniques for scaling up MCMI to large state spaces in order to perform policy
improvement.Comment: Appears in Proceedings of the Nineteenth Conference on Uncertainty in
Artificial Intelligence (UAI2003
Self Training Autonomous Driving Agent
Intrinsically, driving is a Markov Decision Process which suits well the
reinforcement learning paradigm. In this paper, we propose a novel agent which
learns to drive a vehicle without any human assistance. We use the concept of
reinforcement learning and evolutionary strategies to train our agent in a 2D
simulation environment. Our model's architecture goes beyond the World Model's
by introducing difference images in the auto encoder. This novel involvement of
difference images in the auto-encoder gives better representation of the latent
space with respect to the motion of vehicle and helps an autonomous agent to
learn more efficiently how to drive a vehicle. Results show that our method
requires fewer (96% less) total agents, (87.5% less) agents per generations,
(70% less) generations and (90% less) rollouts than the original architecture
while achieving the same accuracy of the original
A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning
One of the main obstacles to broad application of reinforcement learning
methods is the parameter sensitivity of our core learning algorithms. In many
large-scale applications, online computation and function approximation
represent key strategies in scaling up reinforcement learning algorithms. In
this setting, we have effective and reasonably well understood algorithms for
adapting the learning-rate parameter, online during learning. Such
meta-learning approaches can improve robustness of learning and enable
specialization to current task, improving learning speed. For
temporal-difference learning algorithms which we study here, there is yet
another parameter, , that similarly impacts learning speed and
stability in practice. Unfortunately, unlike the learning-rate parameter,
parametrizes the objective function that temporal-difference methods
optimize. Different choices of produce different fixed-point
solutions, and thus adapting online and characterizing the
optimization is substantially more complex than adapting the learning-rate
parameter. There are no meta-learning method for that can achieve (1)
incremental updating, (2) compatibility with function approximation, and (3)
maintain stability of learning under both on and off-policy sampling. In this
paper we contribute a novel objective function for optimizing as a
function of state rather than time. We derive a new incremental, linear
complexity -adaption algorithm that does not require offline batch
updating or access to a model of the world, and present a suite of experiments
illustrating the practicality of our new algorithm in three different settings.
Taken together, our contributions represent a concrete step towards black-box
application of temporal-difference learning methods in real world problems
Non-stationary Stochastic Optimization under -Variation Measures
We consider a non-stationary sequential stochastic optimization problem, in
which the underlying cost functions change over time under a variation budget
constraint. We propose an -variation functional to quantify the
change, which yields less variation for dynamic function sequences whose
changes are constrained to short time periods or small subsets of input domain.
Under the -variation constraint, we derive both upper and matching
lower regret bounds for smooth and strongly convex function sequences, which
generalize previous results in Besbes et al. (2015). Furthermore, we provide an
upper bound for general convex function sequences with noisy gradient feedback,
which matches the optimal rate as . Our results reveal some
surprising phenomena under this general variation functional, such as the curse
of dimensionality of the function domain. The key technical novelties in our
analysis include affinity lemmas that characterize the distance of the
minimizers of two convex functions with bounded Lp difference, and a cubic
spline based construction that attains matching lower bounds.Comment: 38 pages, 3 figures. Revised versio
Context-Dependent Upper-Confidence Bounds for Directed Exploration
Directed exploration strategies for reinforcement learning are critical for
learning an optimal policy in a minimal number of interactions with the
environment. Many algorithms use optimism to direct exploration, either through
visitation estimates or upper confidence bounds, as opposed to data-inefficient
strategies like \epsilon-greedy that use random, undirected exploration. Most
data-efficient exploration methods require significant computation, typically
relying on a learned model to guide exploration. Least-squares methods have the
potential to provide some of the data-efficiency benefits of model-based
approaches -- because they summarize past interactions -- with the computation
closer to that of model-free approaches. In this work, we provide a novel,
computationally efficient, incremental exploration strategy, leveraging this
property of least-squares temporal difference learning (LSTD). We derive upper
confidence bounds on the action-values learned by LSTD, with context-dependent
(or state-dependent) noise variance. Such context-dependent noise focuses
exploration on a subset of variable states, and allows for reduced exploration
in other states. We empirically demonstrate that our algorithm can converge
more quickly than other incremental exploration strategies using confidence
estimates on action-values.Comment: Neural Information Processing Systems 201
Finite Sample Analyses for TD(0) with Function Approximation
TD(0) is one of the most commonly used algorithms in reinforcement learning.
Despite this, there is no existing finite sample analysis for TD(0) with
function approximation, even for the linear case. Our work is the first to
provide such results. Existing convergence rates for Temporal Difference (TD)
methods apply only to somewhat modified versions, e.g., projected variants or
ones where stepsizes depend on unknown problem parameters. Our analyses obviate
these artificial alterations by exploiting strong properties of TD(0). We
provide convergence rates both in expectation and with high-probability. The
two are obtained via different approaches that use relatively unknown, recently
developed stochastic approximation techniques
Safe and Efficient Off-Policy Reinforcement Learning
In this work, we take a fresh look at some old and new algorithms for
off-policy, return-based reinforcement learning. Expressing these in a common
form, we derive a novel algorithm, Retrace(), with three desired
properties: (1) it has low variance; (2) it safely uses samples collected from
any behaviour policy, whatever its degree of "off-policyness"; and (3) it is
efficient as it makes the best use of samples collected from near on-policy
behaviour policies. We analyze the contractive nature of the related operator
under both off-policy policy evaluation and control settings and derive online
sample-based algorithms. We believe this is the first return-based off-policy
control algorithm converging a.s. to without the GLIE assumption (Greedy
in the Limit with Infinite Exploration). As a corollary, we prove the
convergence of Watkins' Q(), which was an open problem since 1989. We
illustrate the benefits of Retrace() on a standard suite of Atari 2600
games
Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version
This work tackles the problem of robust zero-shot planning in non-stationary
stochastic environments. We study Markov Decision Processes (MDPs) evolving
over time and consider Model-Based Reinforcement Learning algorithms in this
setting. We make two hypotheses: 1) the environment evolves continuously with a
bounded evolution rate; 2) a current model is known at each decision epoch but
not its evolution. Our contribution can be presented in four points. 1) we
define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We
introduce the notion of regular evolution by making an hypothesis of
Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we
consider a planning agent using the current model of the environment but
unaware of its future evolution. This leads us to consider a worst-case method
where the environment is seen as an adversarial agent; 3) following this
approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot
Model-Based method similar to Minimax search; 4) we illustrate the benefits
brought by RATS empirically and compare its performance with reference
Model-Based algorithms.Comment: Published at NeurIPS 2019, 17 pages, 3 figure
Dyna-H: a heuristic planning reinforcement learning algorithm applied to role-playing-game strategy decision systems
In a Role-Playing Game, finding optimal trajectories is one of the most
important tasks. In fact, the strategy decision system becomes a key component
of a game engine. Determining the way in which decisions are taken (online,
batch or simulated) and the consumed resources in decision making (e.g.
execution time, memory) will influence, in mayor degree, the game performance.
When classical search algorithms such as A* can be used, they are the very
first option. Nevertheless, such methods rely on precise and complete models of
the search space, and there are many interesting scenarios where their
application is not possible. Then, model free methods for sequential decision
making under uncertainty are the best choice. In this paper, we propose a
heuristic planning strategy to incorporate the ability of heuristic-search in
path-finding into a Dyna agent. The proposed Dyna-H algorithm, as A* does,
selects branches more likely to produce outcomes than other branches. Besides,
it has the advantages of being a model-free online reinforcement learning
algorithm. The proposal was evaluated against the one-step Q-Learning and
Dyna-Q algorithms obtaining excellent experimental results: Dyna-H
significantly overcomes both methods in all experiments. We suggest also, a
functional analogy between the proposed sampling from worst trajectories
heuristic and the role of dreams (e.g. nightmares) in human behavior
Contextual Bandits with Similarity Information
In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence
of choices. In each round it chooses from a time-invariant set of alternatives
and receives the payoff associated with this alternative. While the case of
small strategy sets is by now well-understood, a lot of recent work has focused
on MAB problems with exponentially or infinitely large strategy sets, where one
needs to assume extra structure in order to make the problem tractable. In
particular, recent literature considered information on similarity between
arms.
We consider similarity information in the setting of "contextual bandits", a
natural extension of the basic MAB problem where before each round an algorithm
is given the "context" -- a hint about the payoffs in this round. Contextual
bandits are directly motivated by placing advertisements on webpages, one of
the crucial problems in sponsored search. A particularly simple way to
represent similarity information in the contextual bandit setting is via a
"similarity distance" between the context-arm pairs which gives an upper bound
on the difference between the respective expected payoffs.
Prior work on contextual bandits with similarity uses "uniform" partitions of
the similarity space, which is potentially wasteful. We design more efficient
algorithms that are based on adaptive partitions adjusted to "popular" context
and "high-payoff" arms.Comment: This is the full version of a conference paper in COLT 2011, to
appear in JMLR in 2014. A preliminary version of this manuscript (with all
the results) has been posted to arXiv in February 2011. An earlier version on
arXiv, which does not include the results in Section 6, dates back to July
2009. The present revision addresses various presentation issues pointed out
by journal referee
- …