6 research outputs found
Agnostic System Identification for Model-Based Reinforcement Learning
A fundamental problem in control is to learn a model of a system from
observations that is useful for controller synthesis. To provide good
performance guarantees, existing methods must assume that the real system is in
the class of models considered during learning. We present an iterative method
with strong guarantees even in the agnostic case where the system is not in the
class. In particular, we show that any no-regret online learning algorithm can
be used to obtain a near-optimal policy, provided some model achieves low
training error and access to a good exploration distribution. Our approach
applies to both discrete and continuous domains. We demonstrate its efficacy
and scalability on a challenging helicopter domain from the literature.Comment: 8 pages, published in ICML 201
The Utility of Abstaining in Binary Classification
We explore the problem of binary classification in machine learning, with a
twist - the classifier is allowed to abstain on any datum, professing ignorance
about the true class label without committing to any prediction. This is
directly motivated by applications like medical diagnosis and fraud risk
assessment, in which incorrect predictions have potentially calamitous
consequences. We focus on a recent spate of theoretically driven work in this
area that characterizes how allowing abstentions can lead to fewer errors in
very general settings. Two areas are highlighted: the surprising possibility of
zero-error learning, and the fundamental tradeoff between predicting
sufficiently often and avoiding incorrect predictions. We review efficient
algorithms with provable guarantees for each of these areas. We also discuss
connections to other scenarios, notably active learning, as they suggest
promising directions of further inquiry in this emerging field.Comment: Short surve
Bandits with Temporal Stochastic Constraints
We study the effect of impairment on stochastic multi-armed bandits and
develop new ways to mitigate it. Impairment effect is the phenomena where an
agent only accrues reward for an action if they have played it at least a few
times in the recent past. It is practically motivated by repetition and recency
effects in domains such as advertising (here consumer behavior may require
repeat actions by advertisers) and vocational training (here actions are
complex skills that can only be mastered with repetition to get a payoff).
Impairment can be naturally modelled as a temporal constraint on the strategy
space, and we provide two novel algorithms that achieve sublinear regret, each
working with different assumptions on the impairment effect. We introduce a new
notion called bucketing in our algorithm design, and show how it can
effectively address impairment as well as a broader class of temporal
constraints. Our regret bounds explicitly capture the cost of impairment and
show that it scales (sub-)linearly with the degree of impairment. Our work
complements recent work on modeling delays and corruptions, and we provide
experimental evidence supporting our claims.Comment: An extended abstract appeared in the 4th Multi-disciplinary
Conference on Reinforcement Learning and Decision Making (RLDM 2019
Learning by Repetition: Stochastic Multi-armed Bandits under Priming Effect
We study the effect of persistence of engagement on learning in a stochastic
multi-armed bandit setting. In advertising and recommendation systems,
repetition effect includes a wear-in period, where the user's propensity to
reward the platform via a click or purchase depends on how frequently they see
the recommendation in the recent past. It also includes a counteracting
wear-out period, where the user's propensity to respond positively is dampened
if the recommendation was shown too many times recently. Priming effect can be
naturally modelled as a temporal constraint on the strategy space, since the
reward for the current action depends on historical actions taken by the
platform. We provide novel algorithms that achieves sublinear regret in time
and the relevant wear-in/wear-out parameters. The effect of priming on the
regret upper bound is also additive, and we get back a guarantee that matches
popular algorithms such as the UCB1 and Thompson sampling when there is no
priming effect. Our work complements recent work on modeling time varying
rewards, delays and corruptions in bandits, and extends the usage of rich
behavior models in sequential decision making settings.Comment: Appears in the 36th Conference on Uncertainty in Artificial
Intelligence (UAI 2020
Active Online Learning with Hidden Shifting Domains
Online machine learning systems need to adapt to domain shifts. Meanwhile,
acquiring label at every timestep is expensive. We propose a surprisingly
simple algorithm that adaptively balances its regret and its number of label
queries in settings where the data streams are from a mixture of hidden
domains. For online linear regression with oblivious adversaries, we provide a
tight tradeoff that depends on the durations and dimensionalities of the hidden
domains. Our algorithm can adaptively deal with interleaving spans of inputs
from different domains. We also generalize our results to non-linear regression
for hypothesis classes with bounded eluder dimension and adaptive adversaries.
Experiments on synthetic and realistic datasets demonstrate that our algorithm
achieves lower regret than uniform queries and greedy queries with equal
labeling budget
24th Annual Conference on Learning Theory Agnostic KWIK learning and efficient approximate reinforcement learning
A popular approach in reinforcement learning is to use a model-based algorithm, i.e., an algorithm that utilizes a model learner to learn an approximate model to the environment. It has been shown that such a model-based learner is efficient if the model learner is efficient in the so-called “knows what it knows ” (KWIK) framework. A major limitation of the standard KWIK framework is that, by its very definition, it covers only the case when the (model) learner can represent the actual environment with no errors. In this paper, we study the agnostic KWIK learning model, where we relax this assumption by allowing nonzero approximation errors. We show that with the new definition an efficient model learner still leads to an efficient reinforcement learning algorithm. At the same time, though, we find that learning within the new framework can be substantially slower as compared to the standard framework, even in the case of simple learning problems. Keywords: KWIK learning, agnostic learning, reinforcement learning, PAC-MDP 1