225 research outputs found
Incentivizing Exploration with Selective Data Disclosure
We study the design of rating systems that incentivize (more) efficient
social learning among self-interested agents. Agents arrive sequentially and
are presented with a set of possible actions, each of which yields a positive
reward with an unknown probability. A disclosure policy sends messages about
the rewards of previously-chosen actions to arriving agents. These messages can
alter agents' incentives towards exploration, taking potentially sub-optimal
actions for the sake of learning more about their rewards. Prior work achieves
much progress with disclosure policies that merely recommend an action to each
user, but relies heavily on standard, yet very strong rationality assumptions.
We study a particular class of disclosure policies that use messages, called
unbiased subhistories, consisting of the actions and rewards from a subsequence
of past agents. Each subsequence is chosen ahead of time, according to a
predetermined partial order on the rounds. We posit a flexible model of
frequentist agent response, which we argue is plausible for this class of
"order-based" disclosure policies. We measure the success of a policy by its
regret, i.e., the difference, over all rounds, between the expected reward of
the best action and the reward induced by the policy. A disclosure policy that
reveals full history in each round risks inducing herding behavior among the
agents, and typically has regret linear in the time horizon . Our main
result is an order-based disclosure policy that obtains regret
. This regret is known to be optimal in the worst case
over reward distributions, even absent incentives. We also exhibit simpler
order-based policies with higher, but still sublinear, regret. These policies
can be interpreted as dividing a sublinear number of agents into constant-sized
focus groups, whose histories are then revealed to future agents
Incentivizing Exploration with Heterogeneous Value of Money
Recently, Frazier et al. proposed a natural model for crowdsourced
exploration of different a priori unknown options: a principal is interested in
the long-term welfare of a population of agents who arrive one by one in a
multi-armed bandit setting. However, each agent is myopic, so in order to
incentivize him to explore options with better long-term prospects, the
principal must offer the agent money. Frazier et al. showed that a simple class
of policies called time-expanded are optimal in the worst case, and
characterized their budget-reward tradeoff.
The previous work assumed that all agents are equally and uniformly
susceptible to financial incentives. In reality, agents may have different
utility for money. We therefore extend the model of Frazier et al. to allow
agents that have heterogeneous and non-linear utilities for money. The
principal is informed of the agent's tradeoff via a signal that could be more
or less informative.
Our main result is to show that a convex program can be used to derive a
signal-dependent time-expanded policy which achieves the best possible
Lagrangian reward in the worst case. The worst-case guarantee is matched by
so-called "Diamonds in the Rough" instances; the proof that the guarantees
match is based on showing that two different convex programs have the same
optimal solution for these specific instances. These results also extend to the
budgeted case as in Frazier et al. We also show that the optimal policy is
monotone with respect to information, i.e., the approximation ratio of the
optimal policy improves as the signals become more informative.Comment: WINE 201
Incentivizing Exploration with Linear Contexts and Combinatorial Actions
We advance the study of incentivized bandit exploration, in which arm choices
are viewed as recommendations and are required to be Bayesian incentive
compatible. Recent work has shown under certain independence assumptions that
after collecting enough initial samples, the popular Thompson sampling
algorithm becomes incentive compatible. We give an analog of this result for
linear bandits, where the independence of the prior is replaced by a natural
convexity condition. This opens up the possibility of efficient and
regret-optimal incentivized exploration in high-dimensional action spaces. In
the semibandit model, we also improve the sample complexity for the
pre-Thompson sampling phase of initial data collection.Comment: International Conference on Machine Learning (ICML) 202
Learning User Preferences to Incentivize Exploration in the Sharing Economy
We study platforms in the sharing economy and discuss the need for
incentivizing users to explore options that otherwise would not be chosen. For
instance, rental platforms such as Airbnb typically rely on customer reviews to
provide users with relevant information about different options. Yet, often a
large fraction of options does not have any reviews available. Such options are
frequently neglected as viable choices, and in turn are unlikely to be
evaluated, creating a vicious cycle. Platforms can engage users to deviate from
their preferred choice by offering monetary incentives for choosing a different
option instead. To efficiently learn the optimal incentives to offer, we
consider structural information in user preferences and introduce a novel
algorithm - Coordinated Online Learning (CoOL) - for learning with structural
information modeled as convex constraints. We provide formal guarantees on the
performance of our algorithm and test the viability of our approach in a user
study with data of apartments on Airbnb. Our findings suggest that our approach
is well-suited to learn appropriate incentives and increase exploration on the
investigated platform.Comment: Longer version of AAAI'18 paper. arXiv admin note: text overlap with
arXiv:1702.0284
Incentivized Exploration for Multi-Armed Bandits under Reward Drift
We study incentivized exploration for the multi-armed bandit (MAB) problem
where the players receive compensation for exploring arms other than the greedy
choice and may provide biased feedback on reward. We seek to understand the
impact of this drifted reward feedback by analyzing the performance of three
instantiations of the incentivized MAB algorithm: UCB, -Greedy,
and Thompson Sampling. Our results show that they all achieve regret and compensation under the drifted reward, and are therefore
effective in incentivizing exploration. Numerical examples are provided to
complement the theoretical analysis.Comment: 10 pages, 2 figures, AAAI 202
Flow-based Intrinsic Curiosity Module
In this paper, we focus on a prediction-based novelty estimation strategy
upon the deep reinforcement learning (DRL) framework, and present a flow-based
intrinsic curiosity module (FICM) to exploit the prediction errors from optical
flow estimation as exploration bonuses. We propose the concept of leveraging
motion features captured between consecutive observations to evaluate the
novelty of observations in an environment. FICM encourages a DRL agent to
explore observations with unfamiliar motion features, and requires only two
consecutive frames to obtain sufficient information when estimating the
novelty. We evaluate our method and compare it with a number of existing
methods on multiple benchmark environments, including Atari games, Super Mario
Bros., and ViZDoom. We demonstrate that FICM is favorable to tasks or
environments featuring moving objects, which allow FICM to utilize the motion
features between consecutive observations. We further ablatively analyze the
encoding efficiency of FICM, and discuss its applicable domains
comprehensively.Comment: The SOLE copyright holder is IJCAI (International Joint Conferences
on Artificial Intelligence), all rights reserved. The link is provided as
follows: https://www.ijcai.org/Proceedings/2020/28
- …