590 research outputs found
Parameter-free locally differentially private stochastic subgradient descent
https://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfPublished versio
Online Learning for Changing Environments using Coin Betting
A key challenge in online learning is that classical algorithms can be slow
to adapt to changing environments. Recent studies have proposed "meta"
algorithms that convert any online learning algorithm to one that is adaptive
to changing environments, where the adaptivity is analyzed in a quantity called
the strongly-adaptive regret. This paper describes a new meta algorithm that
has a strongly-adaptive regret bound that is a factor of
better than other algorithms with the same time complexity, where is the
time horizon. We also extend our algorithm to achieve a first-order (i.e.,
dependent on the observed losses) strongly-adaptive regret bound for the first
time, to our knowledge. At its heart is a new parameter-free algorithm for the
learning with expert advice (LEA) problem in which experts sometimes do not
output advice for consecutive time steps (i.e., \emph{sleeping} experts). This
algorithm is derived by a reduction from optimal algorithms for the so-called
coin betting problem. Empirical results show that our algorithm outperforms
state-of-the-art methods in both learning with expert advice and metric
learning scenarios.Comment: submitted to a journal. arXiv admin note: substantial text overlap
with arXiv:1610.0457
Data Poisoning Attacks in Contextual Bandits
We study offline data poisoning attacks in contextual bandits, a class of
reinforcement learning problems with important applications in online
recommendation and adaptive medical treatment, among others. We provide a
general attack framework based on convex optimization and show that by slightly
manipulating rewards in the data, an attacker can force the bandit algorithm to
pull a target arm for a target contextual vector. The target arm and target
contextual vector are both chosen by the attacker. That is, the attacker can
hijack the behavior of a contextual bandit. We also investigate the feasibility
and the side effects of such attacks, and identify future directions for
defense. Experiments on both synthetic and real-world data demonstrate the
efficiency of the attack algorithm.Comment: GameSec 201
Parameter-Free Online Convex Optimization with Sub-Exponential Noise
We consider the problem of unconstrained online convex optimization (OCO)
with sub-exponential noise, a strictly more general problem than the standard
OCO. In this setting, the learner receives a subgradient of the loss functions
corrupted by sub-exponential noise and strives to achieve optimal regret
guarantee, without knowledge of the competitor norm, i.e., in a parameter-free
way. Recently, Cutkosky and Boahen (COLT 2017) proved that, given unbounded
subgradients, it is impossible to guarantee a sublinear regret due to an
exponential penalty. This paper shows that it is possible to go around the
lower bound by allowing the observed subgradients to be unbounded via
stochastic noise. However, the presence of unbounded noise in unconstrained OCO
is challenging; existing algorithms do not provide near-optimal regret bounds
or fail to have a guarantee. So, we design a novel parameter-free OCO algorithm
for Banach space, which we call BANCO, via a reduction to betting on noisy
coins. We show that BANCO achieves the optimal regret rate in our problem.
Finally, we show the application of our results to obtain a parameter-free
locally private stochastic subgradient descent algorithm, and the connection to
the law of iterated logarithms.Comment: v1: Accepted to COLT'19, v2: adjusted Theorem 3, w_t closed form
solution, and typo
Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards
We study -armed bandit problems where the reward distributions of the arms
are all supported on the interval. It has been a challenge to design
regret-efficient randomized exploration algorithms in this setting. Maillard
sampling~\cite{maillard13apprentissage}, an attractive alternative to Thompson
sampling, has recently been shown to achieve competitive regret guarantees in
the sub-Gaussian reward setting~\cite{bian2022maillard} while maintaining
closed-form action probabilities, which is useful for offline policy
evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling
(KL-MS) algorithm, a natural extension of Maillard sampling for achieving
KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic
optimality when the rewards are Bernoulli and has a worst-case regret bound of
the form , where is the
expected reward of the optimal arm, and is the time horizon length
- …