24,973 research outputs found
Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach
Reinforcement learning (RL) agents have traditionally been tasked with
maximizing the value function of a Markov decision process (MDP), either in
continuous settings, with fixed discount factor , or in episodic
settings, with . While this has proven effective for specific tasks
with well-defined objectives (e.g., games), it has never been established that
fixed discounting is suitable for general purpose use (e.g., as a model of
human preferences). This paper characterizes rationality in sequential decision
making using a set of seven axioms and arrives at a form of discounting that
generalizes traditional fixed discounting. In particular, our framework admits
a state-action dependent "discount" factor that is not constrained to be less
than 1, so long as there is eventual long run discounting. Although this
broadens the range of possible preference structures in continuous settings, we
show that there exists a unique "optimizing MDP" with fixed whose
optimal value function matches the true utility of the optimal policy, and we
quantify the difference between value and utility for suboptimal policies. Our
work can be seen as providing a normative justification for (a slight
generalization of) Martha White's RL task formalism (2017) and other recent
departures from the traditional RL, and is relevant to task specification in
RL, inverse RL and preference-based RL.Comment: 8 pages + 1 page supplement. In proceedings of AAAI 2019. Slides,
poster and bibtex available at
https://silviupitis.com/#rethinking-the-discount-factor-in-reinforcement-learning-a-decision-theoretic-approac
Trajectory-Based Off-Policy Deep Reinforcement Learning
Policy gradient methods are powerful reinforcement learning algorithms and
have been demonstrated to solve many complex tasks. However, these methods are
also data-inefficient, afflicted with high variance gradient estimates, and
frequently get stuck in local optima. This work addresses these weaknesses by
combining recent improvements in the reuse of off-policy data and exploration
in parameter space with deterministic behavioral policies. The resulting
objective is amenable to standard neural network optimization strategies like
stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo.
Incorporation of previous rollouts via importance sampling greatly improves
data-efficiency, whilst stochastic optimization schemes facilitate the escape
from local optima. We evaluate the proposed approach on a series of continuous
control benchmark tasks. The results show that the proposed algorithm is able
to successfully and reliably learn solutions using fewer system interactions
than standard policy gradient methods.Comment: Includes appendix. Accepted for ICML 201
Sequential Design for Ranking Response Surfaces
We propose and analyze sequential design methods for the problem of ranking
several response surfaces. Namely, given response surfaces over a
continuous input space , the aim is to efficiently find the index of
the minimal response across the entire . The response surfaces are not
known and have to be noisily sampled one-at-a-time. This setting is motivated
by stochastic control applications and requires joint experimental design both
in space and response-index dimensions. To generate sequential design
heuristics we investigate stepwise uncertainty reduction approaches, as well as
sampling based on posterior classification complexity. We also make connections
between our continuous-input formulation and the discrete framework of pure
regret in multi-armed bandits. To model the response surfaces we utilize
kriging surrogates. Several numerical examples using both synthetic data and an
epidemics control problem are provided to illustrate our approach and the
efficacy of respective adaptive designs.Comment: 26 pages, 7 figures (updated several sections and figures
- …