234 research outputs found
When Are Linear Stochastic Bandits Attackable?
We study adversarial attacks on linear stochastic bandits: by manipulating
the rewards, an adversary aims to control the behaviour of the bandit
algorithm. Perhaps surprisingly, we first show that some attack goals can never
be achieved. This is in sharp contrast to context-free stochastic bandits, and
is intrinsically due to the correlation among arms in linear stochastic
bandits. Motivated by this finding, this paper studies the attackability of a
-armed linear bandit environment. We first provide a complete necessity and
sufficiency characterization of attackability based on the geometry of the
arms' context vectors. We then propose a two-stage attack method against LinUCB
and Robust Phase Elimination. The method first asserts whether the given
environment is attackable; and if yes, it poisons the rewards to force the
algorithm to pull a target arm linear times using only a sublinear cost.
Numerical experiments further validate the effectiveness and cost-efficiency of
the proposed attack method.Comment: 27 pages, 3 figures, ICML 202
Adversarial Attacks on Linear Contextual Bandits
Contextual bandit algorithms are applied in a wide range of domains, from
advertising to recommender systems, from clinical trials to education. In many
of these domains, malicious agents may have incentives to attack the bandit
algorithm to induce it to perform a desired behavior. For instance, an
unscrupulous ad publisher may try to increase their own revenue at the expense
of the advertisers; a seller may want to increase the exposure of their
products, or thwart a competitor's advertising campaign. In this paper, we
study several attack scenarios and show that a malicious agent can force a
linear contextual bandit algorithm to pull any desired arm times
over a horizon of steps, while applying adversarial modifications to either
rewards or contexts that only grow logarithmically as . We also
investigate the case when a malicious agent is interested in affecting the
behavior of the bandit algorithm in a single context (e.g., a specific user).
We first provide sufficient conditions for the feasibility of the attack and we
then propose an efficient algorithm to perform the attack. We validate our
theoretical results on experiments performed on both synthetic and real-world
datasets
Interactive and Concentrated Differential Privacy for Bandits
Bandits play a crucial role in interactive learning schemes and modern
recommender systems. However, these systems often rely on sensitive user data,
making privacy a critical concern. This paper investigates privacy in bandits
with a trusted centralized decision-maker through the lens of interactive
Differential Privacy (DP). While bandits under pure -global DP have
been well-studied, we contribute to the understanding of bandits under zero
Concentrated DP (zCDP). We provide minimax and problem-dependent lower bounds
on regret for finite-armed and linear bandits, which quantify the cost of
-global zCDP in these settings. These lower bounds reveal two hardness
regimes based on the privacy budget and suggest that -global zCDP
incurs less regret than pure -global DP. We propose two -global
zCDP bandit algorithms, AdaC-UCB and AdaC-GOPE, for finite-armed and linear
bandits respectively. Both algorithms use a common recipe of Gaussian mechanism
and adaptive episodes. We analyze the regret of these algorithms to show that
AdaC-UCB achieves the problem-dependent regret lower bound up to multiplicative
constants, while AdaC-GOPE achieves the minimax regret lower bound up to
poly-logarithmic factors. Finally, we provide experimental validation of our
theoretical results under different settings
Corruption-Robust Offline Reinforcement Learning with General Function Approximation
We investigate the problem of corruption robustness in offline reinforcement
learning (RL) with general function approximation, where an adversary can
corrupt each sample in the offline dataset, and the corruption level
quantifies the cumulative corruption amount over episodes and
steps. Our goal is to find a policy that is robust to such corruption and
minimizes the suboptimality gap with respect to the optimal policy for the
uncorrupted Markov decision processes (MDPs). Drawing inspiration from the
uncertainty-weighting technique from the robust online RL setting
\citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight
iteration procedure to efficiently compute on batched samples and propose a
corruption-robust algorithm for offline RL. Notably, under the assumption of
single policy coverage and the knowledge of , our proposed algorithm
achieves a suboptimality bound that is worsened by an additive factor of
due to the corruption.
Here is the coverage
coefficient that depends on the regularization parameter , the
confidence set , and the dataset , and
is a coefficient that depends on
and the underlying data distribution . When specialized to linear MDPs,
the corruption-dependent error term reduces to
with being the dimension of the feature map, which matches the existing
lower bound for corrupted linear MDPs. This suggests that our analysis is tight
in terms of the corruption-dependent term
Reward Poisoning in Reinforcement Learning: Attacks Against Unknown Learners in Unknown Environments
We study black-box reward poisoning attacks against reinforcement learning (RL), in which an adversary aims to manipulate the rewards to mislead a sequence of RL agents with unknown algorithms to learn a nefarious policy in an environment unknown to the adversary a priori. That is, our attack makes minimum assumptions on the prior knowledge of the adversary: it has no initial knowledge of the environment or the learner, and neither does it observe the learner's internal mechanism except for its performed actions. We design a novel black-box attack, U2, that can provably achieve a near-matching performance to the state-of-the-art white-box attack, demonstrating the feasibility of reward poisoning even in the most challenging black-box setting
- …