234 research outputs found

    Linear Programming for Large-Scale Markov Decision Problems

    Get PDF
    We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over state-action pairs, and we consider a neighborhood of a low-dimensional subset of the set of stationary distributions (defined in terms of state-action features) as the comparison class. We propose two techniques, one based on stochastic convex optimization, and one based on constraint sampling. In both cases, we give bounds that show that the performance of our algorithms approaches the best achievable by any policy in the comparison class. Most importantly, these results depend on the size of the comparison class, but not on the size of the state space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application.Comment: 27 pages, 3 figure

    Proximal Point Imitation Learning

    Full text link
    This work develops new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning (IL) with linear function approximation without restrictive coherence assumptions. We begin with the minimax formulation of the problem and then outline how to leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing, for online and offline IL, respectively. Thanks to PPM, we avoid nested policy evaluation and cost updates for online IL appearing in the prior literature. In particular, we do away with the conventional alternating updates by the optimization of a single convex and smooth objective over both cost and Q-functions. When solved inexactly, we relate the optimization errors to the suboptimality of the recovered policy. As an added bonus, by re-interpreting PPM as dual smoothing with the expert policy as a center point, we also obtain an offline IL algorithm enjoying theoretical guarantees in terms of required expert trajectories. Finally, we achieve convincing empirical performance for both linear and neural network function approximation

    Privacy and security in cyber-physical systems

    Get PDF
    Data privacy has attracted increasing attention in the past decade due to the emerging technologies that require our data to provide utility. Service providers (SPs) encourage users to share their personal data in return for a better user experience. However, users' raw data usually contains implicit sensitive information that can be inferred by a third party. This raises great concern about users' privacy. In this dissertation, we develop novel techniques to achieve a better privacy-utility trade-off (PUT) in various applications. We first consider smart meter (SM) privacy and employ physical resources to minimize the information leakage to the SP through SM readings. We measure privacy using information-theoretic metrics and find private data release policies (PDRPs) by formulating the problem as a Markov decision process (MDP). We also propose noise injection techniques for time-series data privacy. We characterize optimal PDRPs measuring privacy via mutual information (MI) and utility loss via added distortion. Reformulating the problem as an MDP, we solve it using deep reinforcement learning (DRL) for real location trace data. We also consider a scenario for hiding an underlying ``sensitive'' variable and revealing a ``useful'' variable for utility by periodically selecting from among sensors to share the measurements with an SP. We formulate this as an optimal stopping problem and solve using DRL. We then consider privacy-aware communication over a wiretap channel. We maximize the information delivered to the legitimate receiver, while minimizing the information leakage from the sensitive attribute to the eavesdropper. We propose using a variational-autoencoder (VAE) and validate our approach with colored and annotated MNIST dataset. Finally, we consider defenses against active adversaries in the context of security-critical applications. We propose an adversarial example (AE) generation method exploiting the data distribution. We perform adversarial training using the proposed AEs and evaluate the performance against real-world adversarial attacks.Open Acces

    Efficient Data-Driven Robust Policies for Reinforcement Learning

    Get PDF
    Applying the reinforcement learning methodology to domains that involve risky decisions like medicine or robotics requires high confidence in the performance of a policy before its deployment. Markov Decision Processes (MDPs) have served as a well-established model in reinforcement learning (RL). An MDP model assumes that the exact transitional probabilities and rewards are available. However, in most cases, these parameters are unknown and are typically estimated from data, which are inherently prone to errors. Consequently, due to such statistical errors, the resulting computed policy\u27s actual performance is often different from the designer\u27s expectation. In this context, practitioners can either be negligent and ignore parameter uncertainty during decision-making or be pessimistic by planning to be protected against the worst-case scenario. This dissertation focuses on a moderate mindset that strikes a balance between the two contradicting points of view. This objective is also known as the percentile criterion and can be modeled as risk-aversion to epistemic uncertainty. We propose several RL algorithms that efficiently compute reliable policies with limited data that notably improve the policies\u27 performance and alleviate the computational complexity compared to standard risk-averse RL algorithms. Furthermore, we present a fast and robust feature selection method for linear value function approximation, a standard approach to solving reinforcement learning problems with large state spaces. Our experiments show that our technique is faster and more stable than alternative methods

    Convex Q Learning in a Stochastic Environment: Extended Version

    Full text link
    The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation. The algorithms and theory rest on a relaxation of a dual of Manne's celebrated linear programming characterization of optimal control. The main contributions firstly concern properties of the relaxation, described as a deterministic convex program: we identify conditions for a bounded solution, and a significant relationship between the solution to the new convex program, and the solution to standard Q-learning. The second set of contributions concern algorithm design and analysis: (i) A direct model-free method for approximating the convex program for Q-learning shares properties with its ideal. In particular, a bounded solution is ensured subject to a simple property of the basis functions; (ii) The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense; (iii) The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering ``relative'' dynamic programming equations; (iv) The theory is illustrated with an application to a classical inventory control problem.Comment: Extended version of "Convex Q-learning in a stochastic environment", IEEE Conference on Decision and Control, 2023 (to appear

    Risk-sensitive Inverse Reinforcement Learning via Semi- and Non-Parametric Methods

    Full text link
    The literature on Inverse Reinforcement Learning (IRL) typically assumes that humans take actions in order to minimize the expected value of a cost function, i.e., that humans are risk neutral. Yet, in practice, humans are often far from being risk neutral. To fill this gap, the objective of this paper is to devise a framework for risk-sensitive IRL in order to explicitly account for a human's risk sensitivity. To this end, we propose a flexible class of models based on coherent risk measures, which allow us to capture an entire spectrum of risk preferences from risk-neutral to worst-case. We propose efficient non-parametric algorithms based on linear programming and semi-parametric algorithms based on maximum likelihood for inferring a human's underlying risk measure and cost function for a rich class of static and dynamic decision-making settings. The resulting approach is demonstrated on a simulated driving game with ten human participants. Our method is able to infer and mimic a wide range of qualitatively different driving styles from highly risk-averse to risk-neutral in a data-efficient manner. Moreover, comparisons of the Risk-Sensitive (RS) IRL approach with a risk-neutral model show that the RS-IRL framework more accurately captures observed participant behavior both qualitatively and quantitatively, especially in scenarios where catastrophic outcomes such as collisions can occur.Comment: Submitted to International Journal of Robotics Research; Revision 1: (i) Clarified minor technical points; (ii) Revised proof for Theorem 3 to hold under weaker assumptions; (iii) Added additional figures and expanded discussions to improve readabilit
    • …
    corecore