234 research outputs found
Linear Programming for Large-Scale Markov Decision Problems
We consider the problem of controlling a Markov decision process (MDP) with a
large state space, so as to minimize average cost. Since it is intractable to
compete with the optimal policy for large scale problems, we pursue the more
modest goal of competing with a low-dimensional family of policies. We use the
dual linear programming formulation of the MDP average cost problem, in which
the variable is a stationary distribution over state-action pairs, and we
consider a neighborhood of a low-dimensional subset of the set of stationary
distributions (defined in terms of state-action features) as the comparison
class. We propose two techniques, one based on stochastic convex optimization,
and one based on constraint sampling. In both cases, we give bounds that show
that the performance of our algorithms approaches the best achievable by any
policy in the comparison class. Most importantly, these results depend on the
size of the comparison class, but not on the size of the state space.
Preliminary experiments show the effectiveness of the proposed algorithms in a
queuing application.Comment: 27 pages, 3 figure
Proximal Point Imitation Learning
This work develops new algorithms with rigorous efficiency guarantees for
infinite horizon imitation learning (IL) with linear function approximation
without restrictive coherence assumptions. We begin with the minimax
formulation of the problem and then outline how to leverage classical tools
from optimization, in particular, the proximal-point method (PPM) and dual
smoothing, for online and offline IL, respectively. Thanks to PPM, we avoid
nested policy evaluation and cost updates for online IL appearing in the prior
literature. In particular, we do away with the conventional alternating updates
by the optimization of a single convex and smooth objective over both cost and
Q-functions. When solved inexactly, we relate the optimization errors to the
suboptimality of the recovered policy. As an added bonus, by re-interpreting
PPM as dual smoothing with the expert policy as a center point, we also obtain
an offline IL algorithm enjoying theoretical guarantees in terms of required
expert trajectories. Finally, we achieve convincing empirical performance for
both linear and neural network function approximation
Privacy and security in cyber-physical systems
Data privacy has attracted increasing attention in the past decade due to the emerging technologies that require our data to provide utility. Service providers (SPs) encourage users to share their personal data in return for a better user experience. However, users' raw data usually contains implicit sensitive information that can be inferred by a third party. This raises great concern about users' privacy.
In this dissertation, we develop novel techniques to achieve a better privacy-utility trade-off (PUT) in various applications. We first consider smart meter (SM) privacy and employ physical resources to minimize the information leakage to the SP through SM readings. We measure privacy using information-theoretic metrics and find private data release policies (PDRPs) by formulating the problem as a Markov decision process (MDP).
We also propose noise injection techniques for time-series data privacy. We characterize optimal PDRPs measuring privacy via mutual information (MI) and utility loss via added distortion. Reformulating the problem as an MDP, we solve it using deep reinforcement learning (DRL) for real location trace data.
We also consider a scenario for hiding an underlying ``sensitive'' variable and revealing a ``useful'' variable for utility by periodically selecting from among sensors to share the measurements with an SP.
We formulate this as an optimal stopping problem and solve using DRL. We then consider privacy-aware communication over a wiretap channel. We maximize the information delivered to the legitimate receiver, while minimizing the information leakage from the sensitive attribute to the eavesdropper.
We propose using a variational-autoencoder (VAE) and validate our approach with colored and annotated MNIST dataset.
Finally, we consider defenses against active adversaries in the context of security-critical applications. We propose an adversarial example (AE) generation method exploiting the data distribution. We perform adversarial training using the proposed AEs and evaluate the performance against real-world adversarial attacks.Open Acces
Efficient Data-Driven Robust Policies for Reinforcement Learning
Applying the reinforcement learning methodology to domains that involve risky decisions like medicine or robotics requires high confidence in the performance of a policy before its deployment. Markov Decision Processes (MDPs) have served as a well-established model in reinforcement learning (RL). An MDP model assumes that the exact transitional probabilities and rewards are available. However, in most cases, these parameters are unknown and are typically estimated from data, which are inherently prone to errors. Consequently, due to such statistical errors, the resulting computed policy\u27s actual performance is often different from the designer\u27s expectation. In this context, practitioners can either be negligent and ignore parameter uncertainty during decision-making or be pessimistic by planning to be protected against the worst-case scenario. This dissertation focuses on a moderate mindset that strikes a balance between the two contradicting points of view. This objective is also known as the percentile criterion and can be modeled as risk-aversion to epistemic uncertainty. We propose several RL algorithms that efficiently compute reliable policies with limited data that notably improve the policies\u27 performance and alleviate the computational complexity compared to standard risk-averse RL algorithms. Furthermore, we present a fast and robust feature selection method for linear value function approximation, a standard approach to solving reinforcement learning problems with large state spaces. Our experiments show that our technique is faster and more stable than alternative methods
Recommended from our members
Approximate dynamic programming for large scale systems
Sequential decision making under uncertainty is at the heart of a wide variety of practical problems. These problems can be cast as dynamic programs and the optimal value function can be computed by solving Bellman's equation. However, this approach is limited in its applicability. As the number of state variables increases, the state space size grows exponentially, a phenomenon known as the curse of dimensionality, rendering the standard dynamic programming approach impractical. An effective way of addressing curse of dimensionality is through parameterized value function approximation. Such an approximation is determined by relatively small number of parameters and serves as an estimate of the optimal value function. But in order for this approach to be effective, we need Approximate Dynamic Programming (ADP) algorithms that can deliver `good' approximation to the optimal value function and such an approximation can then be used to derive policies for effective decision-making. From a practical standpoint, in order to assess the effectiveness of such an approximation, there is also a need for methods that give a sense for the suboptimality of a policy. This thesis is an attempt to address both these issues. First, we introduce a new ADP algorithm based on linear programming, to compute value function approximations. LP approaches to approximate DP have typically relied on a natural `projection' of a well studied linear program for exact dynamic programming. Such programs restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program -- the `smoothed approximate linear program' -- is distinct from such approaches and relaxes the restriction to lower bounding approximations in an appropriate fashion while remaining computationally tractable. The resulting program enjoys strong approximation guarantees and is shown to perform well in numerical experiments with the game of Tetris and queueing network control problem. Next, we consider optimal stopping problems with applications to pricing of high-dimensional American options. We introduce the pathwise optimization (PO) method: a new convex optimization procedure to produce upper and lower bounds on the optimal value (the `price') of high-dimensional optimal stopping problems. The PO method builds on a dual characterization of optimal stopping problems as optimization problems over the space of martingales, which we dub the martingale duality approach. We demonstrate via numerical experiments that the PO method produces upper bounds and lower bounds (via suboptimal exercise policies) of a quality comparable with state-of-the-art approaches. Further, we develop an approximation theory relevant to martingale duality approaches in general and the PO method in particular. Finally, we consider a broad class of MDPs and introduce a new tractable method for computing bounds by consider information relaxation and introducing penalty. The method delivers tight bounds by identifying the best penalty function among a parameterized class of penalty functions. We implement our method on a high-dimensional financial application, namely, optimal execution and demonstrate the practical value of the method vis-a-vis competing methods available in the literature. In addition, we provide theory to show that bounds generated by our method are provably tighter than some of the other available approaches
Convex Q Learning in a Stochastic Environment: Extended Version
The paper introduces the first formulation of convex Q-learning for Markov
decision processes with function approximation. The algorithms and theory rest
on a relaxation of a dual of Manne's celebrated linear programming
characterization of optimal control. The main contributions firstly concern
properties of the relaxation, described as a deterministic convex program: we
identify conditions for a bounded solution, and a significant relationship
between the solution to the new convex program, and the solution to standard
Q-learning. The second set of contributions concern algorithm design and
analysis: (i) A direct model-free method for approximating the convex program
for Q-learning shares properties with its ideal. In particular, a bounded
solution is ensured subject to a simple property of the basis functions; (ii)
The proposed algorithms are convergent and new techniques are introduced to
obtain the rate of convergence in a mean-square sense; (iii) The approach can
be generalized to a range of performance criteria, and it is found that
variance can be reduced by considering ``relative'' dynamic programming
equations; (iv) The theory is illustrated with an application to a classical
inventory control problem.Comment: Extended version of "Convex Q-learning in a stochastic environment",
IEEE Conference on Decision and Control, 2023 (to appear
Risk-sensitive Inverse Reinforcement Learning via Semi- and Non-Parametric Methods
The literature on Inverse Reinforcement Learning (IRL) typically assumes that
humans take actions in order to minimize the expected value of a cost function,
i.e., that humans are risk neutral. Yet, in practice, humans are often far from
being risk neutral. To fill this gap, the objective of this paper is to devise
a framework for risk-sensitive IRL in order to explicitly account for a human's
risk sensitivity. To this end, we propose a flexible class of models based on
coherent risk measures, which allow us to capture an entire spectrum of risk
preferences from risk-neutral to worst-case. We propose efficient
non-parametric algorithms based on linear programming and semi-parametric
algorithms based on maximum likelihood for inferring a human's underlying risk
measure and cost function for a rich class of static and dynamic
decision-making settings. The resulting approach is demonstrated on a simulated
driving game with ten human participants. Our method is able to infer and mimic
a wide range of qualitatively different driving styles from highly risk-averse
to risk-neutral in a data-efficient manner. Moreover, comparisons of the
Risk-Sensitive (RS) IRL approach with a risk-neutral model show that the RS-IRL
framework more accurately captures observed participant behavior both
qualitatively and quantitatively, especially in scenarios where catastrophic
outcomes such as collisions can occur.Comment: Submitted to International Journal of Robotics Research; Revision 1:
(i) Clarified minor technical points; (ii) Revised proof for Theorem 3 to
hold under weaker assumptions; (iii) Added additional figures and expanded
discussions to improve readabilit
- …