6 research outputs found

    Sales Channel Optimization via Simulations Based on Observational Data with Delayed Rewards: A Case Study at LinkedIn

    Full text link
    Training models on data obtained from randomized experiments is ideal for making good decisions. However, randomized experiments are often time-consuming, expensive, risky, infeasible or unethical to perform, leaving decision makers little choice but to rely on observational data collected under historical policies when training models. This opens questions regarding not only which decision-making policies would perform best in practice, but also regarding the impact of different data collection protocols on the performance of various policies trained on the data, or the robustness of policy performance with respect to changes in problem characteristics such as action- or reward- specific delays in observing outcomes. We aim to answer such questions for the problem of optimizing sales channel allocations at LinkedIn, where sales accounts (leads) need to be allocated to one of three channels, with the goal of maximizing the number of successful conversions over a period of time. A key problem feature constitutes the presence of stochastic delays in observing allocation outcomes, whose distribution is both channel- and outcome- dependent. We built a discrete-time simulation that can handle our problem features and used it to evaluate: a) a historical rule-based policy; b) a supervised machine learning policy (XGBoost); and c) multi-armed bandit (MAB) policies, under different scenarios involving: i) data collection used for training (observational vs randomized); ii) lead conversion scenarios; iii) delay distributions. Our simulation results indicate that LinUCB, a simple MAB policy, consistently outperforms the other policies, achieving a 18-47% lift relative to a rule-based policyComment: Accepted at REVEAL'22 Workshop (16th ACM Conference on Recommender Systems - RecSys 2022

    An Improved Relaxation for Oracle-Efficient Adversarial Contextual Bandits

    Full text link
    We present an oracle-efficient relaxation for the adversarial contextual bandits problem, where the contexts are sequentially drawn i.i.d from a known distribution and the cost sequence is chosen by an online adversary. Our algorithm has a regret bound of O(T23(Klog(Π))13)O(T^{\frac{2}{3}}(K\log(|\Pi|))^{\frac{1}{3}}) and makes at most O(K)O(K) calls per round to an offline optimization oracle, where KK denotes the number of actions, TT denotes the number of rounds and Π\Pi denotes the set of policies. This is the first result to improve the prior best bound of O((TK)23(log(Π))13)O((TK)^{\frac{2}{3}}(\log(|\Pi|))^{\frac{1}{3}}) as obtained by Syrgkanis et al. at NeurIPS 2016, and the first to match the original bound of Langford and Zhang at NeurIPS 2007 which was obtained for the stochastic case.Comment: Appears in NeurIPS 202
    corecore