3,678 research outputs found
Reinforcement Learning with Non-Markovian Rewards
The standard RL world model is that of a Markov Decision Process (MDP). A
basic premise of MDPs is that the rewards depend on the last state and action
only. Yet, many real-world rewards are non-Markovian. For example, a reward for
bringing coffee only if requested earlier and not yet served, is non-Markovian
if the state only records current requests and deliveries. Past work considered
the problem of modeling and solving MDPs with non-Markovian rewards (NMR), but
we know of no principled approaches for RL with NMR. Here, we address the
problem of policy learning from experience with such rewards. We describe and
evaluate empirically four combinations of the classical RL algorithm Q-learning
and R-max with automata learning algorithms to obtain new RL algorithms for
domains with NMR. We also prove that some of these variants converge to an
optimal policy in the limit.Comment: To Appear in AAAI 202
Temporal Logic Monitoring Rewards via Transducers
In Markov Decision Processes (MDPs), rewards are assigned according to a function of the last state and action. This is often limiting, when the considered domain is not naturally Markovian, but becomes so after careful engineering of extended state space. The extended states record information from the past that is sufficient to assign rewards by looking just at the last state and action. Non-Markovian Reward Decision Processes (NRMDPs) extend MDPs by allowing for non-Markovian rewards, which depend on the history of states and actions. Non-Markovian rewards can be specified in temporal logics on finite traces such as LTLf/LDLf, with the great advantage of a higher abstraction and succinctness; they can then be automatically compiled into an MDP with an extended state space. We contribute to the techniques to handle temporal rewards and to the solutions to engineer them. We first present an approach to compiling temporal rewards which merges the formula automata into a single transducer, sometimes saving up to an exponential number of states. We then define monitoring rewards, which add a further level of abstraction to temporal rewards by adopting the four-valued conditions of runtime monitoring; we argue that our compilation technique allows for an efficient handling of monitoring rewards. Finally, we discuss application to reinforcement learning
Using Experience Classification for Training Non-Markovian Tasks
Unlike the standard Reinforcement Learning (RL) model, many real-world tasks
are non-Markovian, whose rewards are predicated on state history rather than
solely on the current state. Solving a non-Markovian task, frequently applied
in practical applications such as autonomous driving, financial trading, and
medical diagnosis, can be quite challenging. We propose a novel RL approach to
achieve non-Markovian rewards expressed in temporal logic LTL (Linear
Temporal Logic over Finite Traces). To this end, an encoding of linear
complexity from LTL into MDPs (Markov Decision Processes) is introduced to
take advantage of advanced RL algorithms. Then, a prioritized experience replay
technique based on the automata structure (semantics equivalent to LTL
specification) is utilized to improve the training process. We empirically
evaluate several benchmark problems augmented with non-Markovian tasks to
demonstrate the feasibility and effectiveness of our approach
Learning Task Specifications from Demonstrations
Real world applications often naturally decompose into several sub-tasks. In
many settings (e.g., robotics) demonstrations provide a natural way to specify
the sub-tasks. However, most methods for learning from demonstrations either do
not provide guarantees that the artifacts learned for the sub-tasks can be
safely recombined or limit the types of composition available. Motivated by
this deficit, we consider the problem of inferring Boolean non-Markovian
rewards (also known as logical trace properties or specifications) from
demonstrations provided by an agent operating in an uncertain, stochastic
environment. Crucially, specifications admit well-defined composition rules
that are typically easy to interpret. In this paper, we formulate the
specification inference task as a maximum a posteriori (MAP) probability
inference problem, apply the principle of maximum entropy to derive an analytic
demonstration likelihood model and give an efficient approach to search for the
most likely specification in a large candidate pool of specifications. In our
experiments, we demonstrate how learning specifications can help avoid common
problems that often arise due to ad-hoc reward composition.Comment: NIPS 201
Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning
We consider an agent interacting with an environment in a single stream of
actions, observations, and rewards, with no reset. This process is not assumed
to be a Markov Decision Process (MDP). Rather, the agent has several
representations (mapping histories of past interactions to a discrete state
space) of the environment with unknown dynamics, only some of which result in
an MDP. The goal is to minimize the average regret criterion against an agent
who knows an MDP representation giving the highest optimal reward, and acts
optimally in it. Recent regret bounds for this setting are of order
with an additive term constant yet exponential in some
characteristics of the optimal MDP. We propose an algorithm whose regret after
time steps is , with all constants reasonably small. This is
optimal in since is the optimal regret in the setting of
learning in a (single discrete) MDP
Omega-Regular Reward Machines
Reinforcement learning (RL) is a powerful approach for training agents to
perform tasks, but designing an appropriate reward mechanism is critical to its
success. However, in many cases, the complexity of the learning objectives goes
beyond the capabilities of the Markovian assumption, necessitating a more
sophisticated reward mechanism. Reward machines and omega-regular languages are
two formalisms used to express non-Markovian rewards for quantitative and
qualitative objectives, respectively. This paper introduces omega-regular
reward machines, which integrate reward machines with omega-regular languages
to enable an expressive and effective reward mechanism for RL. We present a
model-free RL algorithm to compute epsilon-optimal strategies against
omega-egular reward machines and evaluate the effectiveness of the proposed
algorithm through experiments.Comment: To appear in ECAI-202
Omega-Regular Reward Machines
Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate reward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabilities of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and ω-regular languages are two formalisms used to express non-Markovian rewards for quantitative and qualitative objectives, respectively. This paper introduces ω-regular reward machines, which integrate reward machines with ω-regular languages to enable an expressive and effective reward mechanism for RL. We present a model-free RL algorithm to compute ϵ-optimal strategies against ω-regular reward machines and evaluate the effectiveness of the proposed algorithm through experiments.</p
- …