73,020 research outputs found
Asynchronous Gossip for Averaging and Spectral Ranking
We consider two variants of the classical gossip algorithm. The first variant
is a version of asynchronous stochastic approximation. We highlight a
fundamental difficulty associated with the classical asynchronous gossip
scheme, viz., that it may not converge to a desired average, and suggest an
alternative scheme based on reinforcement learning that has guaranteed
convergence to the desired average. We then discuss a potential application to
a wireless network setting with simultaneous link activation constraints. The
second variant is a gossip algorithm for distributed computation of the
Perron-Frobenius eigenvector of a nonnegative matrix. While the first variant
draws upon a reinforcement learning algorithm for an average cost controlled
Markov decision problem, the second variant draws upon a reinforcement learning
algorithm for risk-sensitive control. We then discuss potential applications of
the second variant to ranking schemes, reputation networks, and principal
component analysis.Comment: 14 pages, 7 figures. Minor revisio
Diagnostic Evaluation of Policy-Gradient-Based Ranking
Learning-to-rank has been intensively studied and has shown significantly increasing values in a wide range of domains, such as web search, recommender systems, dialogue systems, machine translation, and even computational biology, to name a few. In light of recent advances in neural networks, there has been a strong and continuing interest in exploring how to deploy popular techniques, such as reinforcement learning and adversarial learning, to solve ranking problems. However, armed with the aforesaid popular techniques, most studies tend to show how effective a new method is. A comprehensive comparison between techniques and an in-depth analysis of their deficiencies are somehow overlooked. This paper is motivated by the observation that recent ranking methods based on either reinforcement learning or adversarial learning boil down to policy-gradient-based optimization. Based on the widely used benchmark collections with complete information (where relevance labels are known for all items), such as MSLRWEB30K and Yahoo-Set1, we thoroughly investigate the extent to which policy-gradient-based ranking methods are effective. On one hand, we analytically identify the pitfalls of policy-gradient-based ranking. On the other hand, we experimentally compare a wide range of representative methods. The experimental results echo our analysis and show that policy-gradient-based ranking methods are, by a large margin, inferior to many conventional ranking methods. Regardless of whether we use reinforcement learning or adversarial learning, the failures are largely attributable to the gradient estimation based on sampled rankings, which significantly diverge from ideal rankings. In particular, the larger the number of documents per query and the more fine-grained the ground-truth labels, the greater the impact policy-gradient-based ranking suffers. Careful examination of this weakness is highly recommended for developing enhanced methods based on policy gradient
CoRide: Joint Order Dispatching and Fleet Management for Multi-Scale Ride-Hailing Platforms
How to optimally dispatch orders to vehicles and how to tradeoff between
immediate and future returns are fundamental questions for a typical
ride-hailing platform. We model ride-hailing as a large-scale parallel ranking
problem and study the joint decision-making task of order dispatching and fleet
management in online ride-hailing platforms. This task brings unique challenges
in the following four aspects. First, to facilitate a huge number of vehicles
to act and learn efficiently and robustly, we treat each region cell as an
agent and build a multi-agent reinforcement learning framework. Second, to
coordinate the agents from different regions to achieve long-term benefits, we
leverage the geographical hierarchy of the region grids to perform hierarchical
reinforcement learning. Third, to deal with the heterogeneous and variant
action space for joint order dispatching and fleet management, we design the
action as the ranking weight vector to rank and select the specific order or
the fleet management destination in a unified formulation. Fourth, to achieve
the multi-scale ride-hailing platform, we conduct the decision-making process
in a hierarchical way where a multi-head attention mechanism is utilized to
incorporate the impacts of neighbor agents and capture the key agent in each
scale. The whole novel framework is named as CoRide. Extensive experiments
based on multiple cities real-world data as well as analytic synthetic data
demonstrate that CoRide provides superior performance in terms of platform
revenue and user experience in the task of city-wide hybrid order dispatching
and fleet management over strong baselines.Comment: CIKM 201
APRIL: Active Preference-learning based Reinforcement Learning
This paper focuses on reinforcement learning (RL) with limited prior
knowledge. In the domain of swarm robotics for instance, the expert can hardly
design a reward function or demonstrate the target behavior, forbidding the use
of both standard RL and inverse reinforcement learning. Although with a limited
expertise, the human expert is still often able to emit preferences and rank
the agent demonstrations. Earlier work has presented an iterative
preference-based RL framework: expert preferences are exploited to learn an
approximate policy return, thus enabling the agent to achieve direct policy
search. Iteratively, the agent selects a new candidate policy and demonstrates
it; the expert ranks the new demonstration comparatively to the previous best
one; the expert's ranking feedback enables the agent to refine the approximate
policy return, and the process is iterated. In this paper, preference-based
reinforcement learning is combined with active ranking in order to decrease the
number of ranking queries to the expert needed to yield a satisfactory policy.
Experiments on the mountain car and the cancer treatment testbeds witness that
a couple of dozen rankings enable to learn a competent policy
- …