Personalization of playlists is a common feature in music streaming services,
but conventional techniques, such as collaborative filtering, rely on explicit
assumptions regarding content quality to learn how to make recommendations.
Such assumptions often result in misalignment between offline model objectives
and online user satisfaction metrics. In this paper, we present a reinforcement
learning framework that solves for such limitations by directly optimizing for
user satisfaction metrics via the use of a simulated playlist-generation
environment. Using this simulator we develop and train a modified Deep
Q-Network, the action head DQN (AH-DQN), in a manner that addresses the
challenges imposed by the large state and action space of our RL formulation.
The resulting policy is capable of making recommendations from large and
dynamic sets of candidate items with the expectation of maximizing consumption
metrics. We analyze and evaluate agents offline via simulations that use
environment models trained on both public and proprietary streaming datasets.
We show how these agents lead to better user-satisfaction metrics compared to
baseline methods during online A/B tests. Finally, we demonstrate that
performance assessments produced from our simulator are strongly correlated
with observed online metric results.Comment: 10 pages. KDD 2