12 research outputs found
Incentivizing Exploration with Selective Data Disclosure
We study the design of rating systems that incentivize (more) efficient
social learning among self-interested agents. Agents arrive sequentially and
are presented with a set of possible actions, each of which yields a positive
reward with an unknown probability. A disclosure policy sends messages about
the rewards of previously-chosen actions to arriving agents. These messages can
alter agents' incentives towards exploration, taking potentially sub-optimal
actions for the sake of learning more about their rewards. Prior work achieves
much progress with disclosure policies that merely recommend an action to each
user, but relies heavily on standard, yet very strong rationality assumptions.
We study a particular class of disclosure policies that use messages, called
unbiased subhistories, consisting of the actions and rewards from a subsequence
of past agents. Each subsequence is chosen ahead of time, according to a
predetermined partial order on the rounds. We posit a flexible model of
frequentist agent response, which we argue is plausible for this class of
"order-based" disclosure policies. We measure the success of a policy by its
regret, i.e., the difference, over all rounds, between the expected reward of
the best action and the reward induced by the policy. A disclosure policy that
reveals full history in each round risks inducing herding behavior among the
agents, and typically has regret linear in the time horizon . Our main
result is an order-based disclosure policy that obtains regret
. This regret is known to be optimal in the worst case
over reward distributions, even absent incentives. We also exhibit simpler
order-based policies with higher, but still sublinear, regret. These policies
can be interpreted as dividing a sublinear number of agents into constant-sized
focus groups, whose histories are then revealed to future agents
Optimal Algorithm for Bayesian Incentive-Compatible Exploration
We consider a social planner faced with a stream of myopic selfish agents.
The goal of the social planner is to maximize the social welfare, however, it
is limited to using only information asymmetry (regarding previous outcomes)
and cannot use any monetary incentives. The planner recommends actions to
agents, but her recommendations need to be Bayesian Incentive Compatible to be
followed by the agents. Our main result is an optimal algorithm for the
planner, in the case that the actions realizations are deterministic and have
limited support, making significant important progress on this open problem.
Our optimal protocol has two interesting features. First, it always completes
the exploration of a priori more beneficial actions before exploring a priori
less beneficial actions. Second, the randomization in the protocol is
correlated across agents and actions (and not independent at each decision
time).Comment: EC 201
The Price of Incentivizing Exploration: A Characterization via Thompson Sampling and Sample Complexity
We consider incentivized exploration: a version of multi-armed bandits where
the choice of arms is controlled by self-interested agents, and the algorithm
can only issue recommendations. The algorithm controls the flow of information,
and the information asymmetry can incentivize the agents to explore. Prior work
achieves optimal regret rates up to multiplicative factors that become
arbitrarily large depending on the Bayesian priors, and scale exponentially in
the number of arms. A more basic problem of sampling each arm once runs into
similar factors.
We focus on the price of incentives: the loss in performance, broadly
construed, incurred for the sake of incentive-compatibility. We prove that
Thompson Sampling, a standard bandit algorithm, is incentive-compatible if
initialized with sufficiently many data points. The performance loss due to
incentives is therefore limited to the initial rounds when these data points
are collected. The problem is largely reduced to that of sample complexity: how
many rounds are needed? We address this question, providing matching upper and
lower bounds and instantiating them in various corollaries. Typically, the
optimal sample complexity is polynomial in the number of arms and exponential
in the "strength of beliefs"
Competing Bandits: The Perils of Exploration Under Competition
Most online platforms strive to learn from interactions with users, and many
engage in exploration: making potentially suboptimal choices for the sake of
acquiring new information. We study the interplay between exploration and
competition: how such platforms balance the exploration for learning and the
competition for users. Here users play three distinct roles: they are customers
that generate revenue, they are sources of data for learning, and they are
self-interested agents which choose among the competing platforms.
We consider a stylized duopoly model in which two firms face the same
multi-armed bandit problem. Users arrive one by one and choose between the two
firms, so that each firm makes progress on its bandit problem only if it is
chosen. Through a mix of theoretical results and numerical simulations, we
study whether and to what extent competition incentivizes the adoption of
better bandit algorithms, and whether it leads to welfare increases for users.
We find that stark competition induces firms to commit to a "greedy" bandit
algorithm that leads to low welfare. However, weakening competition by
providing firms with some "free" users incentivizes better exploration
strategies and increases welfare. We investigate two channels for weakening the
competition: relaxing the rationality of users and giving one firm a
first-mover advantage. Our findings are closely related to the "competition vs.
innovation" relationship, and elucidate the first-mover advantage in the
digital economy.Comment: merged and extended version of arXiv:1702.08533 and arXiv:1902.0559