461 research outputs found
Incentivizing Exploration with Selective Data Disclosure
We study the design of rating systems that incentivize (more) efficient
social learning among self-interested agents. Agents arrive sequentially and
are presented with a set of possible actions, each of which yields a positive
reward with an unknown probability. A disclosure policy sends messages about
the rewards of previously-chosen actions to arriving agents. These messages can
alter agents' incentives towards exploration, taking potentially sub-optimal
actions for the sake of learning more about their rewards. Prior work achieves
much progress with disclosure policies that merely recommend an action to each
user, but relies heavily on standard, yet very strong rationality assumptions.
We study a particular class of disclosure policies that use messages, called
unbiased subhistories, consisting of the actions and rewards from a subsequence
of past agents. Each subsequence is chosen ahead of time, according to a
predetermined partial order on the rounds. We posit a flexible model of
frequentist agent response, which we argue is plausible for this class of
"order-based" disclosure policies. We measure the success of a policy by its
regret, i.e., the difference, over all rounds, between the expected reward of
the best action and the reward induced by the policy. A disclosure policy that
reveals full history in each round risks inducing herding behavior among the
agents, and typically has regret linear in the time horizon . Our main
result is an order-based disclosure policy that obtains regret
. This regret is known to be optimal in the worst case
over reward distributions, even absent incentives. We also exhibit simpler
order-based policies with higher, but still sublinear, regret. These policies
can be interpreted as dividing a sublinear number of agents into constant-sized
focus groups, whose histories are then revealed to future agents
Incentivized Exploration for Multi-Armed Bandits under Reward Drift
We study incentivized exploration for the multi-armed bandit (MAB) problem
where the players receive compensation for exploring arms other than the greedy
choice and may provide biased feedback on reward. We seek to understand the
impact of this drifted reward feedback by analyzing the performance of three
instantiations of the incentivized MAB algorithm: UCB, -Greedy,
and Thompson Sampling. Our results show that they all achieve regret and compensation under the drifted reward, and are therefore
effective in incentivizing exploration. Numerical examples are provided to
complement the theoretical analysis.Comment: 10 pages, 2 figures, AAAI 202
Bandit Social Learning: Exploration under Myopic Behavior
We study social learning dynamics where the agents collectively follow a
simple multi-armed bandit protocol. Agents arrive sequentially, choose arms and
receive associated rewards. Each agent observes the full history (arms and
rewards) of the previous agents, and there are no private signals. While
collectively the agents face exploration-exploitation tradeoff, each agent acts
myopically, without regards to exploration. Motivating scenarios concern
reviews and ratings on online platforms.
We allow a wide range of myopic behaviors that are consistent with
(parameterized) confidence intervals, including the "unbiased" behavior as well
as various behaviorial biases. While extreme versions of these behaviors
correspond to well-known bandit algorithms, we prove that more moderate
versions lead to stark exploration failures, and consequently to regret rates
that are linear in the number of agents. We provide matching upper bounds on
regret by analyzing "moderately optimistic" agents.
As a special case of independent interest, we obtain a general result on
failure of the greedy algorithm in multi-armed bandits. This is the first such
result in the literature, to the best of our knowledg
On Statistical Discrimination as a Failure of Social Learning: A Multi-Armed Bandit Approach
We analyze statistical discrimination in hiring markets using a multi-armed
bandit model. Myopic firms face workers arriving with heterogeneous observable
characteristics. The association between the worker's skill and characteristics
is unknown ex ante; thus, firms need to learn it. Laissez-faire causes
perpetual underestimation: minority workers are rarely hired, and therefore,
underestimation towards them tends to persist. Even a slight population-ratio
imbalance frequently produces perpetual underestimation. We propose two policy
solutions: a novel subsidy rule (the hybrid mechanism) and the Rooney Rule. Our
results indicate that temporary affirmative actions effectively mitigate
discrimination caused by insufficient data
What Lies Ahead: An Exploration of Future Orientation, Self-Control, and Delinquency
Self-control has been consistently linked to antisocial behavior and though low self-control makes delinquency more likely, neither the findings nor the theory suggests that low self-control necessitates participation in such behavior. There remains a shortage of research on those situational factors or individual characteristics that might lessen the effects of low self-control on antisocial behavior. Future orientation is one such characteristic that can have implications for the control of behavior. The purpose of the current study was to explore the independent and interactive effects of future orientation and low self-control on delinquency using data from Wave 1 of the National Longitudinal Study of Adolescent Health. A series of regressions showed that self-control and future orientation had independent effects on delinquent behavior. Further, future-oriented achievement expectations conditioned the effect of self-control on delinquency such that the effects of self-control were weakened with increases in future orientation. The findings suggest that prevention programs should place more emphasis on helping youth plan for the future. Further, research should more fully explore the other aspects of future orientation (e.g., specificity of planning and change/stability of aspirations), as they relate to self-control and delinquency
The Impact of Social Impact Bond Financing
Social impact bonds (SIBs), also known as Pay for Success, are an innovation in Payment by Results contracting. Investors finance programs and are repaid based on the “SIB effect,” which includes changes in outcomes attributable to financing. We generate a quantitative estimate of this part of the SIB effect for two active labor market programs in the Netherlands and Switzerland. Comparing program impacts within providers using SIB and non-SIB contracts suggests financing has positive impacts on public benefit receipt, employment, and income. Qualitative research suggests this is because SIB contracts increased pressure for all involved parties, leading to the institutionalization of selection and greater resources for SIB-financed services. Contracts with high pressure, like SIBs, may compromise both performance requirements and the potential to measure performance. We examine the implications of these findings in relation to agency and stewardship theories and highlight the significance of SIBs as multilateral as opposed to bilateral contracts
- …