234 research outputs found

    When Are Linear Stochastic Bandits Attackable?

    Full text link
    We study adversarial attacks on linear stochastic bandits: by manipulating the rewards, an adversary aims to control the behaviour of the bandit algorithm. Perhaps surprisingly, we first show that some attack goals can never be achieved. This is in sharp contrast to context-free stochastic bandits, and is intrinsically due to the correlation among arms in linear stochastic bandits. Motivated by this finding, this paper studies the attackability of a kk-armed linear bandit environment. We first provide a complete necessity and sufficiency characterization of attackability based on the geometry of the arms' context vectors. We then propose a two-stage attack method against LinUCB and Robust Phase Elimination. The method first asserts whether the given environment is attackable; and if yes, it poisons the rewards to force the algorithm to pull a target arm linear times using only a sublinear cost. Numerical experiments further validate the effectiveness and cost-efficiency of the proposed attack method.Comment: 27 pages, 3 figures, ICML 202

    Adversarial Attacks on Linear Contextual Bandits

    Full text link
    Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a seller may want to increase the exposure of their products, or thwart a competitor's advertising campaign. In this paper, we study several attack scenarios and show that a malicious agent can force a linear contextual bandit algorithm to pull any desired arm To(T)T - o(T) times over a horizon of TT steps, while applying adversarial modifications to either rewards or contexts that only grow logarithmically as O(logT)O(\log T). We also investigate the case when a malicious agent is interested in affecting the behavior of the bandit algorithm in a single context (e.g., a specific user). We first provide sufficient conditions for the feasibility of the attack and we then propose an efficient algorithm to perform the attack. We validate our theoretical results on experiments performed on both synthetic and real-world datasets

    Interactive and Concentrated Differential Privacy for Bandits

    Full text link
    Bandits play a crucial role in interactive learning schemes and modern recommender systems. However, these systems often rely on sensitive user data, making privacy a critical concern. This paper investigates privacy in bandits with a trusted centralized decision-maker through the lens of interactive Differential Privacy (DP). While bandits under pure ϵ\epsilon-global DP have been well-studied, we contribute to the understanding of bandits under zero Concentrated DP (zCDP). We provide minimax and problem-dependent lower bounds on regret for finite-armed and linear bandits, which quantify the cost of ρ\rho-global zCDP in these settings. These lower bounds reveal two hardness regimes based on the privacy budget ρ\rho and suggest that ρ\rho-global zCDP incurs less regret than pure ϵ\epsilon-global DP. We propose two ρ\rho-global zCDP bandit algorithms, AdaC-UCB and AdaC-GOPE, for finite-armed and linear bandits respectively. Both algorithms use a common recipe of Gaussian mechanism and adaptive episodes. We analyze the regret of these algorithms to show that AdaC-UCB achieves the problem-dependent regret lower bound up to multiplicative constants, while AdaC-GOPE achieves the minimax regret lower bound up to poly-logarithmic factors. Finally, we provide experimental validation of our theoretical results under different settings

    Corruption-Robust Offline Reinforcement Learning with General Function Approximation

    Full text link
    We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level ζ0\zeta\geq0 quantifies the cumulative corruption amount over nn episodes and HH steps. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs). Drawing inspiration from the uncertainty-weighting technique from the robust online RL setting \citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight iteration procedure to efficiently compute on batched samples and propose a corruption-robust algorithm for offline RL. Notably, under the assumption of single policy coverage and the knowledge of ζ\zeta, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of O(ζ(CC(λ,F^,ZnH))1/2(C(F^,μ))1/2n1)\mathcal O(\zeta \cdot (\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1}) due to the corruption. Here CC(λ,F^,ZnH)\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H) is the coverage coefficient that depends on the regularization parameter λ\lambda, the confidence set F^\hat{\mathcal F}, and the dataset ZnH\mathcal Z_n^H, and C(F^,μ)C(\hat{\mathcal F},\mu) is a coefficient that depends on F^\hat{\mathcal F} and the underlying data distribution μ\mu. When specialized to linear MDPs, the corruption-dependent error term reduces to O(ζdn1)\mathcal O(\zeta d n^{-1}) with dd being the dimension of the feature map, which matches the existing lower bound for corrupted linear MDPs. This suggests that our analysis is tight in terms of the corruption-dependent term

    Reward Poisoning in Reinforcement Learning: Attacks Against Unknown Learners in Unknown Environments

    Get PDF
    We study black-box reward poisoning attacks against reinforcement learning (RL), in which an adversary aims to manipulate the rewards to mislead a sequence of RL agents with unknown algorithms to learn a nefarious policy in an environment unknown to the adversary a priori. That is, our attack makes minimum assumptions on the prior knowledge of the adversary: it has no initial knowledge of the environment or the learner, and neither does it observe the learner's internal mechanism except for its performed actions. We design a novel black-box attack, U2, that can provably achieve a near-matching performance to the state-of-the-art white-box attack, demonstrating the feasibility of reward poisoning even in the most challenging black-box setting
    corecore