11 research outputs found

    Deep Q-learning: a robust control approach

    Get PDF
    This work aims at constructing a bridge between robust control theory and reinforcement learning. Although, reinforcement learning has shown admirable results in complex control tasks, the agent’s learning behaviour is opaque. Meanwhile, system theory has several tools for analyzing and controlling dynamical systems. This paper places deep Q-learning is into a control-oriented perspective to study its learning dynamics with well-established techniques from robust control. An uncertain linear time-invariant model is formulated by means of the neural tangent kernel to describe learning. This novel approach allows giving conditions for stability (convergence) of the learning and enables the analysis of the agent’s behaviour in frequency-domain. The control-oriented approach makes it possible to formulate robust controllers that inject dynamical rewards as control input in the loss function to achieve better convergence properties. Three output-feedback controllers are synthesized: gain scheduling H2, dynamical Hinf, and fixed-structure Hinf controllers. Compared to traditional deep Q-learning techniques, which involve several heuristics, setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature. The proposed approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the Hinf controlled learning can converge faster and receive higher scores (depending on the environment) compared to the benchmark Double deep Q-learning

    Public transport trajectory planning with probabilistic guarantees

    Get PDF
    The paper proposes an eco-cruise control strategy for urban public transportbuses. The aim of the velocity control is ensuring timetable adherence, whileconsidering upstream queue lengths at traffic lights in a probabilistic way. Thecontribution of the paper is twofold. First, the shockwave profile model (SPM)is extended to capture the stochastic nature of traffic queue lengths. The modelis adequate to describe frequent traffic state interruptions at signalized intersections.Based on the distribution function of stochastic traffic volume demand,the randomness in queue length, wave fronts, and vehicle numbers are derived.Then, an outlook is provided on its applicability as a full-scale urban traffic networkmodel. Second, a shrinking horizon model predictive controller (MPC) isproposed for ensuring timetable reliability. The intention is to calculate optimalvelocity commands based on the current position and desired arrival time of thebus while considering upcoming delays due to red signals and eventual queues.The above proposed stochastic traffic model is incorporated in a rolling horizonoptimization via chance-constraining. In the optimization, probabilistic guaranteesare formulated to minimize delay due to standstill in queues at signalized intersections. Optimization results are analyzed from two particular aspects, (i)feasibility and (ii) closed-loop performance point of views. The novel stochasticprofile model is tested in a high fidelity traffic simulator context. Comparativesimulation results show the viability and importance of stochastic bounds in urbantrajectory design. The proposed algorithm yields smoother bus trajectoriesat an urban corridor, suggesting energy savings compared to benchmark controlstrategies

    Constrained Policy Gradient Method for Safe and Fast Reinforcement Learning: a Neural Tangent Kernel Based Approach

    No full text
    This paper presents a constrained policy gradient algorithm. We introduce constraints for safe learning with the following steps. First, learning is slowed down (lazy learning) so that the episodic policy change can be computed with the help of the policy gradient theorem and the neural tangent kernel. Then, this enables us the evaluation of the policy at arbitrary states too. In the same spirit, learning can be guided, ensuring safety via augmenting episode batches with states where the desired action probabilities are prescribed. Finally, exogenous discounted sum of future rewards (returns) can be computed at these specific state-action pairs such that the policy network satisfies constraints. Computing the returns is based on solving a system of linear equations (equality constraints) or a constrained quadratic program (inequality constraints). Simulation results suggest that adding constraints (external information) to the learning can improve learning in terms of speed and safety reasonably if constraints are appropriately selected. The efficiency of the constrained learning was demonstrated with a shallow and wide ReLU network in the Cartpole and Lunar Lander OpenAI gym environments. The main novelty of the paper is giving a practical use of the neural tangent kernel in reinforcement learning

    Constrained Policy Gradient Method for Safe and Fast Reinforcement Learning: a Neural Tangent Kernel Based Approach

    No full text
    This paper presents a constrained policy gradient algorithm. We introduce constraints for safe learning with the following steps. First, learning is slowed down (lazy learning) so that the episodic policy change can be computed with the help of the policy gradient theorem and the neural tangent kernel. Then, this enables us the evaluation of the policy at arbitrary states too. In the same spirit, learning can be guided, ensuring safety via augmenting episode batches with states where the desired action probabilities are prescribed. Finally, exogenous discounted sum of future rewards (returns) can be computed at these specific state-action pairs such that the policy network satisfies constraints. Computing the returns is based on solving a system of linear equations (equality constraints) or a constrained quadratic program (inequality constraints). Simulation results suggest that adding constraints (external information) to the learning can improve learning in terms of speed and safety reasonably if constraints are appropriately selected. The efficiency of the constrained learning was demonstrated with a shallow and wide ReLU network in the Cartpole and Lunar Lander OpenAI gym environments. The main novelty of the paper is giving a practical use of the neural tangent kernel in reinforcement learning

    Deep Q-learning: a robust control approach

    Get PDF
    In this paper, we place deep Q-learning into a control-oriented perspective and study its learning dynamics with well-established techniques from robust control. We formulate an uncertain linear time-invariant model by means of the neural tangent kernel to describe learning. We show the instability of learning and analyze the agent\u27s behavior in frequency-domain. Then, we ensure convergence via robust controllers acting as dynamical rewards in the loss function. We synthesize three controllers: state-feedback gain scheduling 2, dynamic ∞, and constant gain ∞ controllers. Setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature compared to the heuristics in reinforcement learning. In addition, our approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the ∞ controlled learning performs slightly better than Double deep Q-learning

    Data-driven distance metrics for kriging - Short-term urban traffic state prediction

    No full text
    Estimating traffic flow states at unmeasured urban locations provides a cost-efficient solution for many ITS applications. In this work, a geostatistical framework, kriging is extended in such a way that it can both estimate and predict traffic volume and speed at various unobserved locations, in real-time. In the paper, different distance metrics for kriging are evaluated. Then, a new, data-driven one is formulated, capturing the similarity of measurement sites. Then, with multidimensional scaling the distances are transformed into a hyperspace, where the kriging algorithm can be used. As a next step, temporal dependency is injected into the estimator via extending the hyperspace with an extra dimension, enabling for short horizon traffic flow prediction. Additionally, a temporal correction is proposed to compensate for minor changes in traffic flow patterns. Numerical results suggest that the spatio-temporal prediction can make more accurate predictions compared to other distance metric-based kriging algorithms. Additionally, compared to deep learning, the results are on par while the algorithm is more resilient against traffic pattern changes

    Controlled Decent Training

    No full text
    In this work, a novel and model-based artificial neural network (ANN) training method is developed supported by optimal control theory. The method augments training labels in order to robustly guarantee training loss convergence and improve training convergence rate. Dynamic label augmentation is proposed within the framework of gradient descent training where the convergence of training loss is controlled. First, we capture the training behavior with the help of empirical Neural Tangent Kernels (NTK) and borrow tools from systems and control theory to analyze both the local and global training dynamics (e.g. stability, reachability). Second, we propose to dynamically alter the gradient descent training mechanism via fictitious labels as control inputs and an optimal state feedback policy. In this way, we enforce locally H2 optimal and convergent training behavior. The novel algorithm, Controlled Descent Training (CDT), guarantees local convergence. CDT unleashes new potentials in the analysis, interpretation, and design of ANN architectures. The applicability of the method is demonstrated on standard regression and classification problems

    Optimal headway and schedule control of public transport buses

    No full text
    This paper presents a model-based multiobjective control strategy to reduce bus bunching and hence improve public transport reliability. Our goal is twofold. First, we define a proper model, consisting of multiple static and dynamic components. Bus-following model captures the longitudinal dynamics taking into account the interaction with the surrounding traffic. Furthermore, bus stop operations are modeled to estimate dwell time. Second, a shrinking horizon model predictive controller (MPC) is proposed for solving bus bunching problems.The model is able to predict short time-space behavior of public transport buses enabling constrained, finite horizon, optimal control solution to ensure homogeneity of service both in time and space. In this line, the goal with the selected rolling horizon control scheme is to choose a proper velocity profile for the public transport bus such that it keeps both timetable schedule and a desired headway from the bus in front of it (leading bus). The control strategy predicts the arrival time at a bus stop using a passenger arrival and dwell time model. In this vein, the receding horizon model predictive controller calculates an optimal velocity profile based on its current position and desired arrival time. Three different weighting strategies are proposed to test (i) timetable only, (ii) headway only or (iii) balanced timetable - headway tracking. The controller is tested in a high fidelity traffic simulator with realistic scenarios. The behavior of the system is analyzed by considering extreme disturbances. Finally, the existence of a Pareto front between these two objectives is also demonstrated
    corecore