35 research outputs found

    LTLf/LDLf Non-Markovian Rewards

    Get PDF
    In Markov Decision Processes (MDPs), the reward obtained in a state is Markovian, i.e., depends on the last state and action. This dependency makes it difficult to reward more interesting long-term behaviors, such as always closing a door after it has been opened, or providing coffee only following a request. Extending MDPs to handle non-Markovian reward functions was the subject of two previous lines of work. Both use LTL variants to specify the reward function and then compile the new model back into a Markovian model. Building on recent progress in temporal logics over finite traces, we adopt LDLf for specifying non-Markovian rewards and provide an elegant automata construction for building a Markovian model, which extends that of previous work and offers strong minimality and compositionality guarantees

    Temporal Logic Monitoring Rewards via Transducers

    Get PDF
    In Markov Decision Processes (MDPs), rewards are assigned according to a function of the last state and action. This is often limiting, when the considered domain is not naturally Markovian, but becomes so after careful engineering of extended state space. The extended states record information from the past that is sufficient to assign rewards by looking just at the last state and action. Non-Markovian Reward Decision Processes (NRMDPs) extend MDPs by allowing for non-Markovian rewards, which depend on the history of states and actions. Non-Markovian rewards can be specified in temporal logics on finite traces such as LTLf/LDLf, with the great advantage of a higher abstraction and succinctness; they can then be automatically compiled into an MDP with an extended state space. We contribute to the techniques to handle temporal rewards and to the solutions to engineer them. We first present an approach to compiling temporal rewards which merges the formula automata into a single transducer, sometimes saving up to an exponential number of states. We then define monitoring rewards, which add a further level of abstraction to temporal rewards by adopting the four-valued conditions of runtime monitoring; we argue that our compilation technique allows for an efficient handling of monitoring rewards. Finally, we discuss application to reinforcement learning

    Pure-Past Linear Temporal and Dynamic Logic on Finite Traces

    Get PDF
    LTLf and LDLf are well-known logics on finite traces. We review PLTLf and PLDLf, their pure- past versions. These are interpreted backward from the end of the trace towards the beginning. Because of this, we can exploit a foundational result on reverse languages to get an exponential improvement, wrt LTLf /LDLf, in computing the corresponding DFA. This exponential improvement is reflected in several forms sequential decision making involving temporal specifications, such as planning and decision problems in non-deterministic and non-Markovian domains. Interestingly, PLTLf (resp. PLDLf ) has the same expressive power as LTLf (resp. LDLf ), but transforming a PLTLf (resp. PLDLf ) formula into its equivalent in LTLf (resp. LDLf ) is quite expensive. Hence, to take advantage of the exponential improvement, properties of interest must be directly expressed in PLTLf /PLTLf

    Clock specifications for temporal tasks in planning and learning

    Get PDF
    Recently, Linear Temporal Logics on finite traces, such as LTL (or LDL ), have been advocated as high-level formalisms to express dynamic properties, such as goals in planning domains or rewards in Reinforcement Learning (RL). This paper addresses the challenge of separating high-level temporal specifications from the low-level details of the underlying environment (domain or MDP), by allowing for expressing the specifications at a different time granularity than the environment. We study the notion of a clock which progresses the high-level LTL specification, whose ticks are triggered by dynamic (low-level) properties defined on the underlying environment. The obtained separation enables terse high-level specifications while allowing for very expressive forms of clock expressed as general LTL properties over low-level features, such as counting or occurrence/alternation of special events. We devise an automata-based construction to compile away the clock into a deterministic automaton that is polynomial in the size of the automata characterizing the high-level and clock specifications. We show the correctness of the approach and discuss its application in several contexts, including FOND planning, RL with LTL Restraining Bolts, and Reward Machines

    Enablingmarkovian representations under imperfect information

    Get PDF
    Markovian systems are widely used in reinforcement learning (RL), when the successful completion of a task depends exclusively on the last interaction between an autonomous agent and its environment. Unfortunately, real-world instructions are typically complex and often better described as non-Markovian. In this paper we present an extension method that allows solving partially-observable non-Markovian reward decision processes (PONMRDPs) by solving equivalent Markovian models. This potentially facilitates Markovian-based state-of-the-art techniques, including RL, to find optimal behaviours for problems best described as PONMRDP. We provide formal optimality guarantees of our extension methods together with a counterexample illustrating that naive extensions from existing techniques in fully-observable environments cannot provide such guarantees

    Model Checking Strategies from Synthesis Over Finite Traces

    Full text link
    The innovations in reactive synthesis from {\em Linear Temporal Logics over finite traces} (LTLf) will be amplified by the ability to verify the correctness of the strategies generated by LTLf synthesis tools. This motivates our work on {\em LTLf model checking}. LTLf model checking, however, is not straightforward. The strategies generated by LTLf synthesis may be represented using {\em terminating} transducers or {\em non-terminating} transducers where executions are of finite-but-unbounded length or infinite length, respectively. For synthesis, there is no evidence that one type of transducer is better than the other since they both demonstrate the same complexity and similar algorithms. In this work, we show that for model checking, the two types of transducers are fundamentally different. Our central result is that LTLf model checking of non-terminating transducers is \emph{exponentially harder} than that of terminating transducers. We show that the problems are EXPSPACE-complete and PSPACE-complete, respectively. Hence, considering the feasibility of verification, LTLf synthesis tools should synthesize terminating transducers. This is, to the best of our knowledge, the \emph{first} evidence to use one transducer over the other in LTLf synthesis.Comment: Accepted by ATVA 2

    DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning

    Full text link
    This paper proposes DeepSynth, a method for effective training of deep Reinforcement Learning (RL) agents when the reward is sparse and non-Markovian, but at the same time progress towards the reward requires achieving an unknown sequence of high-level objectives. Our method employs a novel algorithm for synthesis of compact automata to uncover this sequential structure automatically. We synthesise a human-interpretable automaton from trace data collected by exploring the environment. The state space of the environment is then enriched with the synthesised automaton so that the generation of a control policy by deep RL is guided by the discovered structure encoded in the automaton. The proposed approach is able to cope with both high-dimensional, low-level features and unknown sparse non-Markovian rewards. We have evaluated DeepSynth's performance in a set of experiments that includes the Atari game Montezuma's Revenge. Compared to existing approaches, we obtain a reduction of two orders of magnitude in the number of iterations required for policy synthesis, and also a significant improvement in scalability.Comment: Extended version of AAAI 2021 pape
    corecore