Search CORE

234 research outputs found

Linear Programming for Large-Scale Markov Decision Problems

Author: Abbasi-Yadkori Yasin
Bartlett Peter L.
Malek Alan
Publication venue
Publication date: 01/01/2014
Field of study

We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over state-action pairs, and we consider a neighborhood of a low-dimensional subset of the set of stationary distributions (defined in terms of state-action features) as the comparison class. We propose two techniques, one based on stochastic convex optimization, and one based on constraint sampling. In both cases, we give bounds that show that the performance of our algorithms approaches the best achievable by any policy in the comparison class. Most importantly, these results depend on the size of the comparison class, but not on the size of the state space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application.Comment: 27 pages, 3 figure

arXiv.org e-Print Archive

CiteSeerX

Queensland University of Technology ePrints Archive

Proximal Point Imitation Learning

Author: Cevher Volkan
Kamoutsi Angeliki
Krawczuk Igor
Neu Gergely
Viano Luca
Publication venue
Publication date: 30/05/2023
Field of study

This work develops new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning (IL) with linear function approximation without restrictive coherence assumptions. We begin with the minimax formulation of the problem and then outline how to leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing, for online and offline IL, respectively. Thanks to PPM, we avoid nested policy evaluation and cost updates for online IL appearing in the prior literature. In particular, we do away with the conventional alternating updates by the optimization of a single convex and smooth objective over both cost and Q-functions. When solved inexactly, we relate the optimization errors to the suboptimality of the recovered policy. As an added bonus, by re-interpreting PPM as dual smoothing with the expert policy as a center point, we also obtain an offline IL algorithm enjoying theoretical guarantees in terms of required expert trajectories. Finally, we achieve convincing empirical performance for both linear and neural network function approximation

arXiv.org e-Print Archive

Privacy and security in cyber-physical systems

Author: Erdemir Ecenaz
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/03/2022
Field of study

Data privacy has attracted increasing attention in the past decade due to the emerging technologies that require our data to provide utility. Service providers (SPs) encourage users to share their personal data in return for a better user experience. However, users' raw data usually contains implicit sensitive information that can be inferred by a third party. This raises great concern about users' privacy. In this dissertation, we develop novel techniques to achieve a better privacy-utility trade-off (PUT) in various applications. We first consider smart meter (SM) privacy and employ physical resources to minimize the information leakage to the SP through SM readings. We measure privacy using information-theoretic metrics and find private data release policies (PDRPs) by formulating the problem as a Markov decision process (MDP). We also propose noise injection techniques for time-series data privacy. We characterize optimal PDRPs measuring privacy via mutual information (MI) and utility loss via added distortion. Reformulating the problem as an MDP, we solve it using deep reinforcement learning (DRL) for real location trace data. We also consider a scenario for hiding an underlying ``sensitive'' variable and revealing a ``useful'' variable for utility by periodically selecting from among sensors to share the measurements with an SP. We formulate this as an optimal stopping problem and solve using DRL. We then consider privacy-aware communication over a wiretap channel. We maximize the information delivered to the legitimate receiver, while minimizing the information leakage from the sensitive attribute to the eavesdropper. We propose using a variational-autoencoder (VAE) and validate our approach with colored and annotated MNIST dataset. Finally, we consider defenses against active adversaries in the context of security-critical applications. We propose an adversarial example (AE) generation method exploiting the data distribution. We perform adversarial training using the proposed AEs and evaluate the performance against real-world adversarial attacks.Open Acces

Spiral - Imperial College Digital Repository

Efficient Data-Driven Robust Policies for Reinforcement Learning

Author: Behzadian Bahram
Publication venue: University of New Hampshire Scholars\u27 Repository
Publication date: 01/05/2022
Field of study

Applying the reinforcement learning methodology to domains that involve risky decisions like medicine or robotics requires high confidence in the performance of a policy before its deployment. Markov Decision Processes (MDPs) have served as a well-established model in reinforcement learning (RL). An MDP model assumes that the exact transitional probabilities and rewards are available. However, in most cases, these parameters are unknown and are typically estimated from data, which are inherently prone to errors. Consequently, due to such statistical errors, the resulting computed policy\u27s actual performance is often different from the designer\u27s expectation. In this context, practitioners can either be negligent and ignore parameter uncertainty during decision-making or be pessimistic by planning to be protected against the worst-case scenario. This dissertation focuses on a moderate mindset that strikes a balance between the two contradicting points of view. This objective is also known as the percentile criterion and can be modeled as risk-aversion to epistemic uncertainty. We propose several RL algorithms that efficiently compute reliable policies with limited data that notably improve the policies\u27 performance and alleviate the computational complexity compared to standard risk-averse RL algorithms. Furthermore, we present a fast and robust feature selection method for linear value function approximation, a standard approach to solving reinforcement learning problems with large state spaces. Our experiments show that our technique is faster and more stable than alternative methods

UNH Scholars' Repository

Recommended from our members

Approximate dynamic programming for large scale systems

Author: Desai Vijay V.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2012
Field of study

Sequential decision making under uncertainty is at the heart of a wide variety of practical problems. These problems can be cast as dynamic programs and the optimal value function can be computed by solving Bellman's equation. However, this approach is limited in its applicability. As the number of state variables increases, the state space size grows exponentially, a phenomenon known as the curse of dimensionality, rendering the standard dynamic programming approach impractical. An effective way of addressing curse of dimensionality is through parameterized value function approximation. Such an approximation is determined by relatively small number of parameters and serves as an estimate of the optimal value function. But in order for this approach to be effective, we need Approximate Dynamic Programming (ADP) algorithms that can deliver `good' approximation to the optimal value function and such an approximation can then be used to derive policies for effective decision-making. From a practical standpoint, in order to assess the effectiveness of such an approximation, there is also a need for methods that give a sense for the suboptimality of a policy. This thesis is an attempt to address both these issues. First, we introduce a new ADP algorithm based on linear programming, to compute value function approximations. LP approaches to approximate DP have typically relied on a natural `projection' of a well studied linear program for exact dynamic programming. Such programs restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program -- the `smoothed approximate linear program' -- is distinct from such approaches and relaxes the restriction to lower bounding approximations in an appropriate fashion while remaining computationally tractable. The resulting program enjoys strong approximation guarantees and is shown to perform well in numerical experiments with the game of Tetris and queueing network control problem. Next, we consider optimal stopping problems with applications to pricing of high-dimensional American options. We introduce the pathwise optimization (PO) method: a new convex optimization procedure to produce upper and lower bounds on the optimal value (the `price') of high-dimensional optimal stopping problems. The PO method builds on a dual characterization of optimal stopping problems as optimization problems over the space of martingales, which we dub the martingale duality approach. We demonstrate via numerical experiments that the PO method produces upper bounds and lower bounds (via suboptimal exercise policies) of a quality comparable with state-of-the-art approaches. Further, we develop an approximation theory relevant to martingale duality approaches in general and the PO method in particular. Finally, we consider a broad class of MDPs and introduce a new tractable method for computing bounds by consider information relaxation and introducing penalty. The method delivers tight bounds by identifying the best penalty function among a parameterized class of penalty functions. We implement our method on a high-dimensional financial application, namely, optimal execution and demonstrate the practical value of the method vis-a-vis competing methods available in the literature. In addition, we provide theory to show that bounds generated by our method are provably tighter than some of the other available approaches

Columbia University Academic Commons

Boosted Fitted Q-Iteration

Author: C. D'Eramo
M. Pirotta
M. Restelli
TOSATTO SAMUELE
Publication venue: PMLR
Publication date: 01/01/2017
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Convex Q Learning in a Stochastic Environment: Extended Version

Author: Lu Fan
Meyn Sean
Publication venue
Publication date: 10/09/2023
Field of study

The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation. The algorithms and theory rest on a relaxation of a dual of Manne's celebrated linear programming characterization of optimal control. The main contributions firstly concern properties of the relaxation, described as a deterministic convex program: we identify conditions for a bounded solution, and a significant relationship between the solution to the new convex program, and the solution to standard Q-learning. The second set of contributions concern algorithm design and analysis: (i) A direct model-free method for approximating the convex program for Q-learning shares properties with its ideal. In particular, a bounded solution is ensured subject to a simple property of the basis functions; (ii) The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense; (iii) The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering ``relative'' dynamic programming equations; (iv) The theory is illustrated with an application to a classical inventory control problem.Comment: Extended version of "Convex Q-learning in a stochastic environment", IEEE Conference on Decision and Control, 2023 (to appear

arXiv.org e-Print Archive

Risk-sensitive Inverse Reinforcement Learning via Semi- and Non-Parametric Methods

Author: Lacotte Jonathan
Majumdar Anirudha
Pavone Marco
Singh Sumeet
Publication venue
Publication date: 01/01/2018
Field of study

The literature on Inverse Reinforcement Learning (IRL) typically assumes that humans take actions in order to minimize the expected value of a cost function, i.e., that humans are risk neutral. Yet, in practice, humans are often far from being risk neutral. To fill this gap, the objective of this paper is to devise a framework for risk-sensitive IRL in order to explicitly account for a human's risk sensitivity. To this end, we propose a flexible class of models based on coherent risk measures, which allow us to capture an entire spectrum of risk preferences from risk-neutral to worst-case. We propose efficient non-parametric algorithms based on linear programming and semi-parametric algorithms based on maximum likelihood for inferring a human's underlying risk measure and cost function for a rich class of static and dynamic decision-making settings. The resulting approach is demonstrated on a simulated driving game with ten human participants. Our method is able to infer and mimic a wide range of qualitatively different driving styles from highly risk-averse to risk-neutral in a data-efficient manner. Moreover, comparisons of the Risk-Sensitive (RS) IRL approach with a risk-neutral model show that the RS-IRL framework more accurately captures observed participant behavior both qualitatively and quantitatively, especially in scenarios where catastrophic outcomes such as collisions can occur.Comment: Submitted to International Journal of Robotics Research; Revision 1: (i) Clarified minor technical points; (ii) Revised proof for Theorem 3 to hold under weaker assumptions; (iii) Added additional figures and expanded discussions to improve readabilit

arXiv.org e-Print Archive

Princeton University Open Access Repository