7 research outputs found
Differentially Private Policy Evaluation
We present the first differentially private algorithms for reinforcement
learning, which apply to the task of evaluating a fixed policy. We establish
two approaches for achieving differential privacy, provide a theoretical
analysis of the privacy and utility of the two algorithms, and show promising
results on simple empirical examples
Actor Critic with Differentially Private Critic
Reinforcement learning algorithms are known to be sample inefficient, and
often performance on one task can be substantially improved by leveraging
information (e.g., via pre-training) on other related tasks. In this work, we
propose a technique to achieve such knowledge transfer in cases where agent
trajectories contain sensitive or private information, such as in the
healthcare domain. Our approach leverages a differentially private policy
evaluation algorithm to initialize an actor-critic model and improve the
effectiveness of learning in downstream tasks. We empirically show this
technique increases sample efficiency in resource-constrained control problems
while preserving the privacy of trajectories collected in an upstream task.Comment: 6 Pages, Presented at the Privacy in Machine Learning Workshop,
NeurIPS 201
Privacy-preserving Q-Learning with Functional Noise in Continuous State Spaces
We consider differentially private algorithms for reinforcement learning in
continuous spaces, such that neighboring reward functions are
indistinguishable. This protects the reward information from being exploited by
methods such as inverse reinforcement learning. Existing studies that guarantee
differential privacy are not extendable to infinite state spaces, as the noise
level to ensure privacy will scale accordingly to infinity. Our aim is to
protect the value function approximator, without regard to the number of states
queried to the function. It is achieved by adding functional noise to the value
function iteratively in the training. We show rigorous privacy guarantees by a
series of analyses on the kernel of the noise space, the probabilistic bound of
such noise samples, and the composition over the iterations. We gain insight
into the utility analysis by proving the algorithm's approximate optimality
when the state space is discrete. Experiments corroborate our theoretical
findings and show improvement over existing approaches.Comment: Advances in Neural Information Processing Systems (NeurIPS) 201
Preventing Imitation Learning with Adversarial Policy Ensembles
Imitation learning can reproduce policies by observing experts, which poses a
problem regarding policy privacy. Policies, such as human, or policies on
deployed robots, can all be cloned without consent from the owners. How can we
protect against external observers cloning our proprietary policies? To answer
this question we introduce a new reinforcement learning framework, where we
train an ensemble of near-optimal policies, whose demonstrations are guaranteed
to be useless for an external observer. We formulate this idea by a constrained
optimization problem, where the objective is to improve proprietary policies,
and at the same time deteriorate the virtual policy of an eventual external
observer. We design a tractable algorithm to solve this new optimization
problem by modifying the standard policy gradient algorithm. Our formulation
can be interpreted in lenses of confidentiality and adversarial behaviour,
which enables a broader perspective of this work. We demonstrate the existence
of "non-clonable" ensembles, providing a solution to the above optimization
problem, which is calculated by our modified policy gradient algorithm. To our
knowledge, this is the first work regarding the protection of policies in
Reinforcement Learning
Privacy-Preserving Resilience of Cyber-Physical Systems to Adversaries
A cyber-physical system (CPS) is expected to be resilient to more than one
type of adversary. In this paper, we consider a CPS that has to satisfy a
linear temporal logic (LTL) objective in the presence of two kinds of
adversaries. The first adversary has the ability to tamper with inputs to the
CPS to influence satisfaction of the LTL objective. The interaction of the CPS
with this adversary is modeled as a stochastic game. We synthesize a controller
for the CPS to maximize the probability of satisfying the LTL objective under
any policy of this adversary. The second adversary is an eavesdropper who can
observe labeled trajectories of the CPS generated from the previous step. It
could then use this information to launch other kinds of attacks. A labeled
trajectory is a sequence of labels, where a label is associated to a state and
is linked to the satisfaction of the LTL objective at that state. We use
differential privacy to quantify the indistinguishability between states that
are related to each other when the eavesdropper sees a labeled trajectory. Two
trajectories of equal length will be differentially private if they are
differentially private at each state along the respective trajectories. We use
a skewed Kantorovich metric to compute distances between probability
distributions over states resulting from actions chosen according to policies
from related states in order to quantify differential privacy. Moreover, we do
this in a manner that does not affect the satisfaction probability of the LTL
objective. We validate our approach on a simulation of a UAV that has to
satisfy an LTL objective in an adversarial environment.Comment: Accepted to the IEEE Conference on Decision and Control (CDC), 202
Private Reinforcement Learning with PAC and Regret Guarantees
Motivated by high-stakes decision-making domains like personalized medicine
where user information is inherently sensitive, we design privacy preserving
exploration policies for episodic reinforcement learning (RL). We first provide
a meaningful privacy formulation using the notion of joint differential privacy
(JDP)--a strong variant of differential privacy for settings where each user
receives their own sets of output (e.g., policy recommendations). We then
develop a private optimism-based learning algorithm that simultaneously
achieves strong PAC and regret bounds, and enjoys a JDP guarantee. Our
algorithm only pays for a moderate privacy cost on exploration: in comparison
to the non-private bounds, the privacy parameter only appears in lower-order
terms. Finally, we present lower bounds on sample complexity and regret for
reinforcement learning subject to JDP
Local Differential Privacy for Regret Minimization in Reinforcement Learning
Reinforcement learning algorithms are widely used in domains where it is
desirable to provide a personalized service. In these domains it is common that
user data contains sensitive information that needs to be protected from third
parties. Motivated by this, we study privacy in the context of finite-horizon
Markov Decision Processes (MDPs) by requiring information to be obfuscated on
the user side. We formulate this notion of privacy for RL by leveraging the
local differential privacy (LDP) framework. We establish a lower bound for
regret minimization in finite-horizon MDPs with LDP guarantees which shows that
guaranteeing privacy has a multiplicative effect on the regret. This result
shows that while LDP is an appealing notion of privacy, it makes the learning
problem significantly more complex. Finally, we present an optimistic algorithm
that simultaneously satisfies -LDP requirements, and achieves
regret in any finite-horizon MDP after episodes,
matching the lower bound dependency on the number of episodes