66 research outputs found
Bayesian Reinforcement Learning via Deep, Sparse Sampling
We address the problem of Bayesian reinforcement learning using efficient
model-based online planning. We propose an optimism-free Bayes-adaptive
algorithm to induce deeper and sparser exploration with a theoretical bound on
its performance relative to the Bayes optimal policy, with a lower
computational complexity. The main novelty is the use of a candidate policy
generator, to generate long-term options in the planning tree (over beliefs),
which allows us to create much sparser and deeper trees. Experimental results
on different environments show that in comparison to the state-of-the-art, our
algorithm is both computationally more efficient, and obtains significantly
higher reward in discrete environments.Comment: Published in AISTATS 202
Marich: A Query-efficient Distributionally Equivalent Model Extraction Attack using Public Data
We study black-box model stealing attacks where the attacker can query a
machine learning model only through publicly available APIs. Specifically, our
aim is to design a black-box model extraction attack that uses minimal number
of queries to create an informative and distributionally equivalent replica of
the target model. First, we define distributionally equivalent and
max-information model extraction attacks. Then, we reduce both the attacks into
a variational optimisation problem. The attacker solves this problem to select
the most informative queries that simultaneously maximise the entropy and
reduce the mismatch between the target and the stolen models. This leads us to
an active sampling-based query selection algorithm, Marich. We evaluate Marich
on different text and image data sets, and different models, including BERT and
ResNet18. Marich is able to extract models that achieve of true
model's accuracy and uses samples from the publicly available
query datasets, which are different from the private training datasets. Models
extracted by Marich yield prediction distributions, which are
closer to the target's distribution in comparison to the existing active
sampling-based algorithms. The extracted models also lead to accuracy
under membership inference attacks. Experimental results validate that Marich
is query-efficient, and also capable of performing task-accurate,
high-fidelity, and informative model extraction.Comment: Presented in the Privacy-Preserving AI (PPAI) workshop at AAAI 2023
as a spotlight tal
How Much Does Each Datapoint Leak Your Privacy? Quantifying the Per-datum Membership Leakage
We study the per-datum Membership Inference Attacks (MIAs), where an attacker
aims to infer whether a fixed target datum has been included in the input
dataset of an algorithm and thus, violates privacy. First, we define the
membership leakage of a datum as the advantage of the optimal adversary
targeting to identify it. Then, we quantify the per-datum membership leakage
for the empirical mean, and show that it depends on the Mahalanobis distance
between the target datum and the data-generating distribution. We further
assess the effect of two privacy defences, i.e. adding Gaussian noise and
sub-sampling. We quantify exactly how both of them decrease the per-datum
membership leakage. Our analysis builds on a novel proof technique that
combines an Edgeworth expansion of the likelihood ratio test and a
Lindeberg-Feller central limit theorem. Our analysis connects the existing
likelihood ratio and scalar product attacks, and also justifies different
canary selection strategies used in the privacy auditing literature. Finally,
our experiments demonstrate the impacts of the leakage score, the sub-sampling
ratio and the noise scale on the per-datum membership leakage as indicated by
the theory
Stochastic Online Instrumental Variable Regression: Regrets for Endogeneity and Bandit Feedback
Endogeneity, i.e. the dependence of noise and covariates, is a common
phenomenon in real data due to omitted variables, strategic behaviours,
measurement errors etc. In contrast, the existing analyses of stochastic online
linear regression with unbounded noise and linear bandits depend heavily on
exogeneity, i.e. the independence of noise and covariates. Motivated by this
gap, we study the over- and just-identified Instrumental Variable (IV)
regression, specifically Two-Stage Least Squares, for stochastic online
learning, and propose to use an online variant of Two-Stage Least Squares,
namely O2SLS. We show that O2SLS achieves
identification and oracle
regret after interactions, where and are the dimensions of
covariates and IVs, and is the bias due to endogeneity. For
, i.e. under exogeneity, O2SLS exhibits oracle regret, which is of the same order as that of the stochastic online
ridge. Then, we leverage O2SLS as an oracle to design OFUL-IV, a stochastic
linear bandit algorithm to tackle endogeneity. OFUL-IV yields
regret that matches the regret
lower bound under exogeneity. For different datasets with endogeneity, we
experimentally show efficiencies of O2SLS and OFUL-IV
Interactive and Concentrated Differential Privacy for Bandits
Bandits play a crucial role in interactive learning schemes and modern
recommender systems. However, these systems often rely on sensitive user data,
making privacy a critical concern. This paper investigates privacy in bandits
with a trusted centralized decision-maker through the lens of interactive
Differential Privacy (DP). While bandits under pure -global DP have
been well-studied, we contribute to the understanding of bandits under zero
Concentrated DP (zCDP). We provide minimax and problem-dependent lower bounds
on regret for finite-armed and linear bandits, which quantify the cost of
-global zCDP in these settings. These lower bounds reveal two hardness
regimes based on the privacy budget and suggest that -global zCDP
incurs less regret than pure -global DP. We propose two -global
zCDP bandit algorithms, AdaC-UCB and AdaC-GOPE, for finite-armed and linear
bandits respectively. Both algorithms use a common recipe of Gaussian mechanism
and adaptive episodes. We analyze the regret of these algorithms to show that
AdaC-UCB achieves the problem-dependent regret lower bound up to multiplicative
constants, while AdaC-GOPE achieves the minimax regret lower bound up to
poly-logarithmic factors. Finally, we provide experimental validation of our
theoretical results under different settings
SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics
Although Reinforcement Learning (RL) is effective for sequential
decision-making problems under uncertainty, it still fails to thrive in
real-world systems where risk or safety is a binding constraint. In this paper,
we formulate the RL problem with safety constraints as a non-zero-sum game.
While deployed with maximum entropy RL, this formulation leads to a safe
adversarially guided soft actor-critic framework, called SAAC. In SAAC, the
adversary aims to break the safety constraint while the RL agent aims to
maximize the constrained value function given the adversary's policy. The
safety constraint on the agent's value function manifests only as a repulsion
term between the agent's and the adversary's policies. Unlike previous
approaches, SAAC can address different safety criteria such as safe
exploration, mean-variance risk sensitivity, and CVaR-like coherent risk
sensitivity. We illustrate the design of the adversary for these constraints.
Then, in each of these variations, we show the agent differentiates itself from
the adversary's unsafe actions in addition to learning to solve the task.
Finally, for challenging continuous control tasks, we demonstrate that SAAC
achieves faster convergence, better efficiency, and fewer failures to satisfy
the safety constraints than risk-averse distributional RL and risk-neutral soft
actor-critic algorithms
- …