20 research outputs found
A Deep Reinforcement Learning Framework for Rebalancing Dockless Bike Sharing Systems
Bike sharing provides an environment-friendly way for traveling and is
booming all over the world. Yet, due to the high similarity of user travel
patterns, the bike imbalance problem constantly occurs, especially for dockless
bike sharing systems, causing significant impact on service quality and company
revenue. Thus, it has become a critical task for bike sharing systems to
resolve such imbalance efficiently. In this paper, we propose a novel deep
reinforcement learning framework for incentivizing users to rebalance such
systems. We model the problem as a Markov decision process and take both
spatial and temporal features into consideration. We develop a novel deep
reinforcement learning algorithm called Hierarchical Reinforcement Pricing
(HRP), which builds upon the Deep Deterministic Policy Gradient algorithm.
Different from existing methods that often ignore spatial information and rely
heavily on accurate prediction, HRP captures both spatial and temporal
dependencies using a divide-and-conquer structure with an embedded localized
module. We conduct extensive experiments to evaluate HRP, based on a dataset
from Mobike, a major Chinese dockless bike sharing company. Results show that
HRP performs close to the 24-timeslot look-ahead optimization, and outperforms
state-of-the-art methods in both service level and bike distribution. It also
transfers well when applied to unseen areas
Deterministic Value-Policy Gradients
Reinforcement learning algorithms such as the deep deterministic policy
gradient algorithm (DDPG) has been widely used in continuous control tasks.
However, the model-free DDPG algorithm suffers from high sample complexity. In
this paper we consider the deterministic value gradients to improve the sample
efficiency of deep reinforcement learning algorithms. Previous works consider
deterministic value gradients with the finite horizon, but it is too myopic
compared with infinite horizon. We firstly give a theoretical guarantee of the
existence of the value gradients in this infinite setting. Based on this
theoretical guarantee, we propose a class of the deterministic value gradient
algorithm (DVG) with infinite horizon, and different rollout steps of the
analytical gradients by the learned model trade off between the variance of the
value gradients and the model bias. Furthermore, to better combine the
model-based deterministic value gradient estimators with the model-free
deterministic policy gradient estimator, we propose the deterministic
value-policy gradient (DVPG) algorithm. We finally conduct extensive
experiments comparing DVPG with state-of-the-art methods on several standard
continuous control benchmarks. Results demonstrate that DVPG substantially
outperforms other baselines
Reinforcement Mechanism Design for E-Commerce
We study the problem of allocating impressions to sellers in e-commerce
websites, such as Amazon, eBay or Taobao, aiming to maximize the total revenue
generated by the platform. We employ a general framework of reinforcement
mechanism design, which uses deep reinforcement learning to design efficient
algorithms, taking the strategic behaviour of the sellers into account.
Specifically, we model the impression allocation problem as a Markov decision
process, where the states encode the history of impressions, prices,
transactions and generated revenue and the actions are the possible impression
allocations in each round. To tackle the problem of continuity and
high-dimensionality of states and actions, we adopt the ideas of the DDPG
algorithm to design an actor-critic policy gradient algorithm which takes
advantage of the problem domain in order to achieve convergence and stability.
We evaluate our proposed algorithm, coined IA(GRU), by comparing it against
DDPG, as well as several natural heuristics, under different rationality models
for the sellers - we assume that sellers follow well-known no-regret type
strategies which may vary in their degree of sophistication. We find that
IA(GRU) outperforms all algorithms in terms of the total revenue