59 research outputs found
Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes
Policy-based algorithms are among the most widely adopted techniques in
model-free RL, thanks to their strong theoretical groundings and good
properties in continuous action spaces. Unfortunately, these methods require
precise and problem-specific hyperparameter tuning to achieve good performance,
and tend to struggle when asked to accomplish a series of heterogeneous tasks.
In particular, the selection of the step size has a crucial impact on their
ability to learn a highly performing policy, affecting the speed and the
stability of the training process, and often being the main culprit for poor
results. In this paper, we tackle these issues with a Meta Reinforcement
Learning approach, by introducing a new formulation, known as meta-MDP, that
can be used to solve any hyperparameter selection problem in RL with contextual
processes. After providing a theoretical Lipschitz bound to the difference of
performance in different tasks, we adopt the proposed framework to train a
batch RL algorithm to dynamically recommend the most adequate step size for
different policies and tasks. In conclusion, we present an experimental
campaign to show the advantages of selecting an adaptive learning rate in
heterogeneous environments
AACC: Asymmetric Actor-Critic in Contextual Reinforcement Learning
Reinforcement Learning (RL) techniques have drawn great attention in many
challenging tasks, but their performance deteriorates dramatically when applied
to real-world problems. Various methods, such as domain randomization, have
been proposed to deal with such situations by training agents under different
environmental setups, and therefore they can be generalized to different
environments during deployment. However, they usually do not incorporate the
underlying environmental factor information that the agents interact with
properly and thus can be overly conservative when facing changes in the
surroundings. In this paper, we first formalize the task of adapting to
changing environmental dynamics in RL as a generalization problem using
Contextual Markov Decision Processes (CMDPs). We then propose the Asymmetric
Actor-Critic in Contextual RL (AACC) as an end-to-end actor-critic method to
deal with such generalization tasks. We demonstrate the essential improvements
in the performance of AACC over existing baselines experimentally in a range of
simulated environments
Sample Complexity Characterization for Linear Contextual MDPs
Contextual Markov decision processes (CMDPs) describe a class of
reinforcement learning problems in which the transition kernels and reward
functions can change over time with different MDPs indexed by a context
variable. While CMDPs serve as an important framework to model many real-world
applications with time-varying environments, they are largely unexplored from
theoretical perspective. In this paper, we study CMDPs under two linear
function approximation models: Model I with context-varying representations and
common linear weights for all contexts; and Model II with common
representations for all contexts and context-varying linear weights. For both
models, we propose novel model-based algorithms and show that they enjoy
guaranteed -suboptimality gap with desired polynomial sample
complexity. In particular, instantiating our result for the first model to the
tabular CMDP improves the existing result by removing the reachability
assumption. Our result for the second model is the first-known result for such
a type of function approximation models. Comparison between our results for the
two models further indicates that having context-varying features leads to much
better sample efficiency than having common representations for all contexts
under linear CMDPs.Comment: accepted to AIstats202
- …