3,633 research outputs found
Incentivizing Exploration with Heterogeneous Value of Money
Recently, Frazier et al. proposed a natural model for crowdsourced
exploration of different a priori unknown options: a principal is interested in
the long-term welfare of a population of agents who arrive one by one in a
multi-armed bandit setting. However, each agent is myopic, so in order to
incentivize him to explore options with better long-term prospects, the
principal must offer the agent money. Frazier et al. showed that a simple class
of policies called time-expanded are optimal in the worst case, and
characterized their budget-reward tradeoff.
The previous work assumed that all agents are equally and uniformly
susceptible to financial incentives. In reality, agents may have different
utility for money. We therefore extend the model of Frazier et al. to allow
agents that have heterogeneous and non-linear utilities for money. The
principal is informed of the agent's tradeoff via a signal that could be more
or less informative.
Our main result is to show that a convex program can be used to derive a
signal-dependent time-expanded policy which achieves the best possible
Lagrangian reward in the worst case. The worst-case guarantee is matched by
so-called "Diamonds in the Rough" instances; the proof that the guarantees
match is based on showing that two different convex programs have the same
optimal solution for these specific instances. These results also extend to the
budgeted case as in Frazier et al. We also show that the optimal policy is
monotone with respect to information, i.e., the approximation ratio of the
optimal policy improves as the signals become more informative.Comment: WINE 201
Learning Adaptive Display Exposure for Real-Time Advertising
In E-commerce advertising, where product recommendations and product ads are
presented to users simultaneously, the traditional setting is to display ads at
fixed positions. However, under such a setting, the advertising system loses
the flexibility to control the number and positions of ads, resulting in
sub-optimal platform revenue and user experience. Consequently, major
e-commerce platforms (e.g., Taobao.com) have begun to consider more flexible
ways to display ads. In this paper, we investigate the problem of advertising
with adaptive exposure: can we dynamically determine the number and positions
of ads for each user visit under certain business constraints so that the
platform revenue can be increased? More specifically, we consider two types of
constraints: request-level constraint ensures user experience for each user
visit, and platform-level constraint controls the overall platform monetization
rate. We model this problem as a Constrained Markov Decision Process with
per-state constraint (psCMDP) and propose a constrained two-level reinforcement
learning approach to decompose the original problem into two relatively
independent sub-problems. To accelerate policy learning, we also devise a
constrained hindsight experience replay mechanism. Experimental evaluations on
industry-scale real-world datasets demonstrate the merits of our approach in
both obtaining higher revenue under the constraints and the effectiveness of
the constrained hindsight experience replay mechanism.Comment: accepted by CIKM201
Budgeted Reinforcement Learning in Continuous State Space
A Budgeted Markov Decision Process (BMDP) is an extension of a Markov
Decision Process to critical applications requiring safety constraints. It
relies on a notion of risk implemented in the shape of a cost signal
constrained to lie below an - adjustable - threshold. So far, BMDPs could only
be solved in the case of finite state spaces with known dynamics. This work
extends the state-of-the-art to continuous spaces environments and unknown
dynamics. We show that the solution to a BMDP is a fixed point of a novel
Budgeted Bellman Optimality operator. This observation allows us to introduce
natural extensions of Deep Reinforcement Learning algorithms to address
large-scale BMDPs. We validate our approach on two simulated applications:
spoken dialogue and autonomous driving.Comment: N. Carrara and E. Leurent have equally contribute
The stochastic goodwill problem
Stochastic control problems related to optimal advertising under uncertainty
are considered. In particular, we determine the optimal strategies for the
problem of maximizing the utility of goodwill at launch time and minimizing the
disutility of a stream of advertising costs that extends until the launch time
for some classes of stochastic perturbations of the classical Nerlove-Arrow
dynamics. We also consider some generalizations such as problems with
constrained budget and with discretionary launching
A Deep Reinforcement Learning Framework for Rebalancing Dockless Bike Sharing Systems
Bike sharing provides an environment-friendly way for traveling and is
booming all over the world. Yet, due to the high similarity of user travel
patterns, the bike imbalance problem constantly occurs, especially for dockless
bike sharing systems, causing significant impact on service quality and company
revenue. Thus, it has become a critical task for bike sharing systems to
resolve such imbalance efficiently. In this paper, we propose a novel deep
reinforcement learning framework for incentivizing users to rebalance such
systems. We model the problem as a Markov decision process and take both
spatial and temporal features into consideration. We develop a novel deep
reinforcement learning algorithm called Hierarchical Reinforcement Pricing
(HRP), which builds upon the Deep Deterministic Policy Gradient algorithm.
Different from existing methods that often ignore spatial information and rely
heavily on accurate prediction, HRP captures both spatial and temporal
dependencies using a divide-and-conquer structure with an embedded localized
module. We conduct extensive experiments to evaluate HRP, based on a dataset
from Mobike, a major Chinese dockless bike sharing company. Results show that
HRP performs close to the 24-timeslot look-ahead optimization, and outperforms
state-of-the-art methods in both service level and bike distribution. It also
transfers well when applied to unseen areas
Markov Decision Processes with Applications in Wireless Sensor Networks: A Survey
Wireless sensor networks (WSNs) consist of autonomous and resource-limited
devices. The devices cooperate to monitor one or more physical phenomena within
an area of interest. WSNs operate as stochastic systems because of randomness
in the monitored environments. For long service time and low maintenance cost,
WSNs require adaptive and robust methods to address data exchange, topology
formulation, resource and power optimization, sensing coverage and object
detection, and security challenges. In these problems, sensor nodes are to make
optimized decisions from a set of accessible strategies to achieve design
goals. This survey reviews numerous applications of the Markov decision process
(MDP) framework, a powerful decision-making tool to develop adaptive algorithms
and protocols for WSNs. Furthermore, various solution methods are discussed and
compared to serve as a guide for using MDPs in WSNs
Task-Based Information Compression for Multi-Agent Communication Problems with Channel Rate Constraints
A collaborative task is assigned to a multiagent system (MAS) in which agents
are allowed to communicate. The MAS runs over an underlying Markov decision
process and its task is to maximize the averaged sum of discounted one-stage
rewards. Although knowing the global state of the environment is necessary for
the optimal action selection of the MAS, agents are limited to individual
observations. The inter-agent communication can tackle the issue of local
observability, however, the limited rate of the inter-agent communication
prevents the agent from acquiring the precise global state information. To
overcome this challenge, agents need to communicate their observations in a
compact way such that the MAS compromises the minimum possible sum of rewards.
We show that this problem is equivalent to a form of rate-distortion problem
which we call the task-based information compression. We introduce a scheme for
task-based information compression titled State aggregation for information
compression (SAIC), for which a state aggregation algorithm is analytically
designed. The SAIC is shown to be capable of achieving near-optimal performance
in terms of the achieved sum of discounted rewards. The proposed algorithm is
applied to a rendezvous problem and its performance is compared with several
benchmarks. Numerical experiments confirm the superiority of the proposed
algorithm.Comment: 13 pages, 9 figure
- …