33 research outputs found

    Temporal Markov Decision Problems : Formalization and Resolution

    Get PDF
    This thesis addresses the question of planning under uncertainty within a time-dependent changing environment. Original motivation for this work came from the problem of building an autonomous agent able to coordinate with its uncertain environment; this environment being composed of other agents communicating their intentions or non-controllable processes for which some discrete-event model is available. We investigate several approaches for modeling continuous time-dependency in the framework of Markov Decision Processes (MDPs), leading us to a definition of Temporal Markov Decision Problems. Then our approach focuses on two separate paradigms. First, we investigate time-dependent problems as \emph{implicit-event} processes and describe them through the formalism of Time-dependent MDPs (TMDPs). We extend the existing results concerning optimality equations and present a new Value Iteration algorithm based on piecewise polynomial function representations in order to solve a more general class of TMDPs. This paves the way to a more general discussion on parametric actions in hybrid state and action spaces MDPs with continuous time. In a second time, we investigate the option of separately modeling the concurrent contributions of exogenous events. This approach of \emph{explicit-event} modeling leads to the use of Generalized Semi-Markov Decision Processes (GSMDP). We establish a link between the general framework of Discrete Events Systems Specification (DEVS) and the formalism of GSMDP, allowing us to build sound discrete-event compatible simulators. Then we introduce a simulation-based Policy Iteration approach for explicit-event Temporal Markov Decision Problems. This algorithmic contribution brings together results from simulation theory, forward search in MDPs, and statistical learning theory. The implicit-event approach was tested on a specific version of the Mars rover planning problem and on a drone patrol mission planning problem while the explicit-event approach was evaluated on a subway network control problem

    Model-based reinforcement learning with small sample size

    Get PDF
    State-of-the-art reinforcement learning (RL) algorithms generally require a large sample of interaction data to learn sufficiently well, which makes it difficult to apply them to the problems where data is expensive. This thesis studies exploration, transformation bias, and policy selection of model-based RL in finite MDPs, all of which has strong impact on sample efficiency. Exploration has previously been studied under the setting where learning-process cumulative reward needs to be maximised. When learning-process cumulative reward is irrelevant and sample efficiency is of primary concern, existing strategies become inefficient and analyses become unsuitable. This thesis formulates the planning for exploration problem, and shows that the efficiency of exploration strategies can be better analysed by comparing their behaviours and exploration costs with the optimal exploration scheme of the planning problem. The weaknesses of existing strategies and the advantages of conducting explicit planning for exploration are presented through an exploration behaviour analysis in tower MDPs. Transformation bias of value estimates in model-based RL has previously been considered insignificant and has not gained much attention. This thesis presents a systematic empirical study to show that when the sample size is small, the transformation bias is not only significant, it can even lead to disastrous effect on the accuracy of value estimates and overall learning performance in some cases. The novel Bootstrap-based Transformation Bias Correction method is proposed to reduce the transformation bias without requiring any additional data. It can work well even when sample size per state-action is very small, which is not possible with the existing method. Policy selection is rarely studied and has been conducted naively by directly comparing two estimated values in most model-based algorithms, which increases the risk of selecting inferior policies due to the asymmetry of the value estimate distributions. To better study the effectiveness of policy selection, two novel family-wise metrics are proposed and analysed in this thesis. The novel Bootstrap-based Policy Voting method is proposed for policy selection, which can significantly reduce the risk of policy selection failures. Then, two novel tournament-based policy refinement methods are proposed, which can improve general RL performance without needing more data

    On the connection of probabilistic model checking, planning, and learning for system verification

    Get PDF
    This thesis presents approaches using techniques from the model checking, planning, and learning community to make systems more reliable and perspicuous. First, two heuristic search and dynamic programming algorithms are adapted to be able to check extremal reachability probabilities, expected accumulated rewards, and their bounded versions, on general Markov decision processes (MDPs). Thereby, the problem space originally solvable by these algorithms is enlarged considerably. Correctness and optimality proofs for the adapted algorithms are given, and in a comprehensive case study on established benchmarks it is shown that the implementation, called Modysh, is competitive with state-of-the-art model checkers and even outperforms them on very large state spaces. Second, Deep Statistical Model Checking (DSMC) is introduced, usable for quality assessment and learning pipeline analysis of systems incorporating trained decision-making agents, like neural networks (NNs). The idea of DSMC is to use statistical model checking to assess NNs resolving nondeterminism in systems modeled as MDPs. The versatility of DSMC is exemplified in a number of case studies on Racetrack, an MDP benchmark designed for this purpose, flexibly modeling the autonomous driving challenge. In a comprehensive scalability study it is demonstrated that DSMC is a lightweight technique tackling the complexity of NN analysis in combination with the state space explosion problem.Diese Arbeit prĂ€sentiert AnsĂ€tze, die Techniken aus dem Model Checking, Planning und Learning Bereich verwenden, um Systeme verlĂ€sslicher und klarer verstĂ€ndlich zu machen. Zuerst werden zwei Algorithmen fĂŒr heuristische Suche und dynamisches Programmieren angepasst, um Extremwerte fĂŒr Erreichbarkeitswahrscheinlichkeiten, Erwartungswerte fĂŒr Kosten und beschrĂ€nkte Varianten davon, auf generellen Markov Entscheidungsprozessen (MDPs) zu untersuchen. Damit wird der Problemraum, der ursprĂŒnglich mit diesen Algorithmen gelöst wurde, deutlich erweitert. Korrektheits- und OptimalitĂ€tsbeweise fĂŒr die angepassten Algorithmen werden gegeben und in einer umfassenden Fallstudie wird gezeigt, dass die Implementierung, namens Modysh, konkurrenzfĂ€hig mit den modernsten Model Checkern ist und deren Leistung auf sehr großen ZustandsrĂ€umen sogar ĂŒbertrifft. Als Zweites wird Deep Statistical Model Checking (DSMC) fĂŒr die QualitĂ€tsbewertung und Lernanalyse von Systemen mit integrierten trainierten Entscheidungsgenten, wie z.B. neuronalen Netzen (NN), eingefĂŒhrt. Die Idee von DSMC ist es, statistisches Model Checking zur Bewertung von NNs zu nutzen, die Nichtdeterminismus in Systemen, die als MDPs modelliert sind, auflösen. Die Vielseitigkeit des Ansatzes wird in mehreren Fallbeispielen auf Racetrack gezeigt, einer MDP Benchmark, die zu diesem Zweck entwickelt wurde und die Herausforderung des autonomen Fahrens flexibel modelliert. In einer umfassenden Skalierbarkeitsstudie wird demonstriert, dass DSMC eine leichtgewichtige Technik ist, die die KomplexitĂ€t der NN-Analyse in Kombination mit dem State Space Explosion Problem bewĂ€ltigt