2,039 research outputs found
Policy Gradient for Coherent Risk Measures
Several authors have recently developed risk-sensitive policy gradient
methods that augment the standard expected cost minimization problem with a
measure of variability in cost. These studies have focused on specific
risk-measures, such as the variance or conditional value at risk (CVaR). In
this work, we extend the policy gradient method to the whole class of coherent
risk measures, which is widely accepted in finance and operations research,
among other fields. We consider both static and time-consistent dynamic risk
measures. For static risk measures, our approach is in the spirit of policy
gradient algorithms and combines a standard sampling approach with convex
programming. For dynamic risk measures, our approach is actor-critic style and
involves explicit approximation of value function. Most importantly, our
contribution presents a unified approach to risk-sensitive reinforcement
learning that generalizes and extends previous results
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments
Optimization of parameterized policies for reinforcement learning (RL) is an
important and challenging problem in artificial intelligence. Among the most
common approaches are algorithms based on gradient ascent of a score function
representing discounted return. In this paper, we examine the role of these
policy gradient and actor-critic algorithms in partially-observable multiagent
environments. We show several candidate policy update rules and relate them to
a foundation of regret minimization and multiagent learning techniques for the
one-shot and tabular cases, leading to previously unknown convergence
guarantees. We apply our method to model-free multiagent reinforcement learning
in adversarial sequential decision problems (zero-sum imperfect information
games), using RL-style function approximation. We evaluate on commonly used
benchmark Poker domains, showing performance against fixed policies and
empirical convergence to approximate Nash equilibria in self-play with rates
similar to or better than a baseline model-free algorithm for zero sum games,
without any domain-specific state space reductions.Comment: NeurIPS 201
DSAC: Distributional Soft Actor Critic for Risk-Sensitive Reinforcement Learning
In this paper, we present a new reinforcement learning (RL) algorithm called
Distributional Soft Actor Critic (DSAC), which exploits the distributional
information of accumulated rewards to achieve better performance. Seamlessly
integrating SAC (which uses entropy to encourage exploration) with a principled
distributional view of the underlying objective, DSAC takes into consideration
the randomness in both action and rewards, and beats the state-of-the-art
baselines in several continuous control benchmarks. Moreover, with the
distributional information of rewards, we propose a unified framework for
risk-sensitive learning, one that goes beyond maximizing only expected
accumulated rewards. Under this framework we discuss three specific
risk-related metrics: percentile, mean-variance and distorted expectation. Our
extensive experiments demonstrate that with distribution modeling in RL, the
agent performs better for both risk-averse and risk-seeking control tasks
Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces
In this paper, we set forth a new vision of reinforcement learning developed
by us over the past few years, one that yields mathematically rigorous
solutions to longstanding important questions that have remained unresolved:
(i) how to design reliable, convergent, and robust reinforcement learning
algorithms (ii) how to guarantee that reinforcement learning satisfies
pre-specified "safety" guarantees, and remains in a stable region of the
parameter space (iii) how to design "off-policy" temporal difference learning
algorithms in a reliable and stable manner, and finally (iv) how to integrate
the study of reinforcement learning into the rich theory of stochastic
optimization. In this paper, we provide detailed answers to all these questions
using the powerful framework of proximal operators.
The key idea that emerges is the use of primal dual spaces connected through
the use of a Legendre transform. This allows temporal difference updates to
occur in dual spaces, allowing a variety of important technical advantages. The
Legendre transform elegantly generalizes past algorithms for solving
reinforcement learning problems, such as natural gradient methods, which we
show relate closely to the previously unconnected framework of mirror descent
methods. Equally importantly, proximal operator theory enables the systematic
development of operator splitting methods that show how to safely and reliably
decompose complex products of gradients that occur in recent variants of
gradient-based temporal difference learning. This key technical innovation
makes it possible to finally design "true" stochastic gradient methods for
reinforcement learning. Finally, Legendre transforms enable a variety of other
benefits, including modeling sparsity and domain geometry. Our work builds
extensively on recent work on the convergence of saddle-point algorithms, and
on the theory of monotone operators.Comment: 121 page
An Actor-Critic Algorithm for Sequence Prediction
We present an approach to training neural networks to generate sequences
using actor-critic methods from reinforcement learning (RL). Current
log-likelihood training methods are limited by the discrepancy between their
training and testing modes, as models must generate tokens conditioned on their
previous guesses rather than the ground-truth tokens. We address this problem
by introducing a \textit{critic} network that is trained to predict the value
of an output token, given the policy of an \textit{actor} network. This results
in a training procedure that is much closer to the test phase, and allows us to
directly optimize for a task-specific score such as BLEU. Crucially, since we
leverage these techniques in the supervised learning setting rather than the
traditional RL setting, we condition the critic network on the ground-truth
output. We show that our method leads to improved performance on both a
synthetic task, and for German-English machine translation. Our analysis paves
the way for such methods to be applied in natural language generation tasks,
such as machine translation, caption generation, and dialogue modelling
Reinforcement Learning
Reinforcement learning (RL) is a general framework for adaptive control,
which has proven to be efficient in many domains, e.g., board games, video
games or autonomous vehicles. In such problems, an agent faces a sequential
decision-making problem where, at every time step, it observes its state,
performs an action, receives a reward and moves to a new state. An RL agent
learns by trial and error a good policy (or controller) based on observations
and numeric reward feedback on the previously performed action. In this
chapter, we present the basic framework of RL and recall the two main families
of approaches that have been developed to learn a good policy. The first one,
which is value-based, consists in estimating the value of an optimal policy,
value from which a policy can be recovered, while the other, called policy
search, directly works in a policy space. Actor-critic methods can be seen as a
policy search technique where the policy value that is learned guides the
policy improvement. Besides, we give an overview of some extensions of the
standard RL framework, notably when risk-averse behavior needs to be taken into
account or when rewards are not available or not known.Comment: Chapter in "A Guided Tour of Artificial Intelligence Research",
Springe
Regret Minimization for Partially Observable Deep Reinforcement Learning
Deep reinforcement learning algorithms that estimate state and state-action
value functions have been shown to be effective in a variety of challenging
domains, including learning control strategies from raw image pixels. However,
algorithms that estimate state and state-action value functions typically
assume a fully observed state and must compensate for partial observations by
using finite length observation histories or recurrent networks. In this work,
we propose a new deep reinforcement learning algorithm based on counterfactual
regret minimization that iteratively updates an approximation to an
advantage-like function and is robust to partially observed state. We
demonstrate that this new algorithm can substantially outperform strong
baseline methods on several partially observed reinforcement learning tasks:
learning first-person 3D navigation in Doom and Minecraft, and acting in the
presence of partially observed objects in Doom and Pong.Comment: ICML 201
A Survey and Critique of Multiagent Deep Reinforcement Learning
Deep reinforcement learning (RL) has achieved outstanding results in recent
years. This has led to a dramatic increase in the number of applications and
methods. Recent works have explored learning beyond single-agent scenarios and
have considered multiagent learning (MAL) scenarios. Initial results report
successes in complex multiagent domains, although there are several challenges
to be addressed. The primary goal of this article is to provide a clear
overview of current multiagent deep reinforcement learning (MDRL) literature.
Additionally, we complement the overview with a broader analysis: (i) we
revisit previous key components, originally presented in MAL and RL, and
highlight how they have been adapted to multiagent deep reinforcement learning
settings. (ii) We provide general guidelines to new practitioners in the area:
describing lessons learned from MDRL works, pointing to recent benchmarks, and
outlining open avenues of research. (iii) We take a more critical tone raising
practical challenges of MDRL (e.g., implementation and computational demands).
We expect this article will help unify and motivate future research to take
advantage of the abundant literature that exists (e.g., RL and MAL) in a joint
effort to promote fruitful research in the multiagent community.Comment: Under review since Oct 2018. Earlier versions of this work had the
title: "Is multiagent deep reinforcement learning the answer or the question?
A brief survey
Bridging the Gap Between Value and Policy Based Reinforcement Learning
We establish a new connection between value and policy based reinforcement
learning (RL) based on a relationship between softmax temporal value
consistency and policy optimality under entropy regularization. Specifically,
we show that softmax consistent action values correspond to optimal entropy
regularized policy probabilities along any action sequence, regardless of
provenance. From this observation, we develop a new RL algorithm, Path
Consistency Learning (PCL), that minimizes a notion of soft consistency error
along multi-step action sequences extracted from both on- and off-policy
traces. We examine the behavior of PCL in different scenarios and show that PCL
can be interpreted as generalizing both actor-critic and Q-learning algorithms.
We subsequently deepen the relationship by showing how a single model can be
used to represent both a policy and the corresponding softmax state values,
eliminating the need for a separate critic. The experimental evaluation
demonstrates that PCL significantly outperforms strong actor-critic and
Q-learning baselines across several benchmarks.Comment: NIPS 201
Learning to Evolve
Evolution and learning are two of the fundamental mechanisms by which life
adapts in order to survive and to transcend limitations. These biological
phenomena inspired successful computational methods such as evolutionary
algorithms and deep learning. Evolution relies on random mutations and on
random genetic recombination. Here we show that learning to evolve, i.e.
learning to mutate and recombine better than at random, improves the result of
evolution in terms of fitness increase per generation and even in terms of
attainable fitness. We use deep reinforcement learning to learn to dynamically
adjust the strategy of evolutionary algorithms to varying circumstances. Our
methods outperform classical evolutionary algorithms on combinatorial and
continuous optimization problems
- …