8 research outputs found
Secure-UCB: Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification
This paper studies bandit algorithms under data poisoning attacks in a
bounded reward setting. We consider a strong attacker model in which the
attacker can observe both the selected actions and their corresponding rewards,
and can contaminate the rewards with additive noise. We show that \emph{any}
bandit algorithm with regret can be forced to suffer a regret
with an expected amount of contamination . This amount
of contamination is also necessary, as we prove that there exists an regret bandit algorithm, specifically the classical UCB, that requires
amount of contamination to suffer regret . To
combat such poising attacks, our second main contribution is to propose a novel
algorithm, Secure-UCB, which uses limited \emph{verification} to access a
limited number of uncontaminated rewards. We show that with
expected number of verifications, Secure-UCB can restore the order optimal
regret \emph{irrespective of the amount of contamination} used by
the attacker. Finally, we prove that for any bandit algorithm, this number of
verifications is necessary to recover the order-optimal regret. We
can then conclude that Secure-UCB is order-optimal in terms of both the
expected regret and the expected number of verifications, and can save
stochastic bandits from any data poisoning attack
Non-stationary Online Learning with Memory and Non-stochastic Control
We study the problem of Online Convex Optimization (OCO) with memory, which
allows loss functions to depend on past decisions and thus captures temporal
effects of learning problems. In this paper, we introduce dynamic policy regret
as the performance measure to design algorithms robust to non-stationary
environments, which competes algorithms' decisions with a sequence of changing
comparators. We propose a novel algorithm for OCO with memory that provably
enjoys an optimal dynamic policy regret. The key technical challenge is how to
control the switching cost, the cumulative movements of player's decisions,
which is neatly addressed by a novel decomposition of dynamic policy regret and
an appropriate meta-expert structure. Furthermore, we apply the results to the
problem of online non-stochastic control, i.e., controlling a linear dynamical
system with adversarial disturbance and convex loss functions. We derive a
novel gradient-based controller with dynamic policy regret guarantees, which is
the first controller competitive to a sequence of changing policies
COMPETING AGAINST ADAPTIVE AGENTS BY MINIMIZING COUNTERFACTUAL NOTIONS OF REGRET
Online learning or sequential decision making is formally defined as a repeated game between an adversary and a player. At every round of the game the player chooses an action from a fixed action set and the adversary reveals a reward/loss for the action played. The goal of the player is to maximize the cumulative reward of her actions. The rewards/losses could be sampled from an unknown distribution or other less restrictive assumptions can be made. The standard measure of performance is the cumulative regret, that is the difference between the cumulative reward of the player and the best achievable reward by a fixed action, or more generally a fixed policy, on the observed reward sequence. For adversaries which are oblivious to the player's strategy, regret is a meaningful measure. However, the adversary is usually adaptive, e.g., in healthcare a patient will respond to given treatments, and for self-driving cars other traffic will react to the behavior of the autonomous agent. In such settings the notion of regret is hard to interpret as the best action in hindsight might not be the best action overall, given the behavior of the adversary. To resolve this problem a new notion called policy regret is introduced. Policy regret is fundamentally different from other forms of regret as it is counterfactual in nature, i.e., the player competes against all other policies whose reward is calculated by taking into account how the adversary would have behaved had the player chosen another policy. This thesis studies policy regret in a partial (bandit) feedback environment, beyond the worst case setting, by leveraging additional structure such as stochasticity/stability of the adversary or additional feedback