Search CORE

539 research outputs found

Knowledge Infused Policy Gradients with Upper Confidence Bound for Relational Bandits

Author: Gaur Manas
Roy Kaushik
Sheth Amit
Zhang Qi
Publication venue
Publication date: 01/01/2021
Field of study

Contextual Bandits find important use cases in various real-life scenarios such as online advertising, recommendation systems, healthcare, etc. However, most of the algorithms use flat feature vectors to represent context whereas, in the real world, there is a varying number of objects and relations among them to model in the context. For example, in a music recommendation system, the user context contains what music they listen to, which artists create this music, the artist albums, etc. Adding richer relational context representations also introduces a much larger context space making exploration-exploitation harder. To improve the efficiency of exploration-exploitation knowledge about the context can be infused to guide the exploration-exploitation strategy. Relational context representations allow a natural way for humans to specify knowledge owing to their descriptive nature. We propose an adaptation of Knowledge Infused Policy Gradients to the Contextual Bandit setting and a novel Knowledge Infused Policy Gradients Upper Confidence Bound algorithm and perform an experimental analysis of a simulated music recommendation dataset and various real-life datasets where expert knowledge can drastically reduce the total regret and where it cannot.Comment: Accepted for publication in the research track at ECML-PKDD 202

arXiv.org e-Print Archive

Scholar Commons - Institutional Repository of the University of South Carolina

Contextual Linear Bandits under Noisy Features: Towards Bayesian Oracles

Author: Combes Richard
Jeong Minchan
Kim Jung-hun
Nam Jun Hyun
Shin Jinwoo
Yun Se-Young
Publication venue
Publication date: 30/05/2022
Field of study

We study contextual linear bandit problems under uncertainty on features; they are noisy with missing entries. To address the challenges from the noise, we analyze Bayesian oracles given observed noisy features. Our Bayesian analysis finds that the optimal hypothesis can be far from the underlying realizability function, depending on noise characteristics, which is highly non-intuitive and does not occur for classical noiseless setups. This implies that classical approaches cannot guarantee a non-trivial regret bound. We thus propose an algorithm aiming at the Bayesian oracle from observed information under this model, achieving

\tilde{O}(d\sqrt{T})

regret bound with respect to feature dimension

d

and time horizon

T

. We demonstrate the proposed algorithm using synthetic and real-world datasets.Comment: 30 page

arXiv.org e-Print Archive

HAL-CentraleSupelec

Reinforcement learning for bandits with continuous actions and large context spaces

Author: Duckworth P
Hawes Nick
Lacerda Bruno
Vallis Katherine A
Publication venue: IOS Press
Publication date: 28/09/2023
Field of study

We consider the challenging scenario of contextual bandits with continuous actions and large context spaces. This is an increasingly important application area in personalised healthcare where an agent is requested to make dosing decisions based on a patient’s single image scan. In this paper, we first adapt a reinforcement learning (RL) algorithm for continuous control to outperform contextual bandit algorithms specifically hand-crafted for continuous action spaces. We empirically demonstrate this on a suite of standard benchmark datasets for vector contexts. Secondly, we demonstrate that our RL agent can generalise problems with continuous actions to large context spaces, providing results that outperform previous methods on image contexts. Thirdly, we introduce a new contextual bandits test domain with multi-dimensional continuous action space and image contexts which existing tree-based methods cannot handle. We provide initial results with our RL agent

Oxford University Research Archive

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

Author: Joachims Thorsten
Swaminathan Adith
Publication venue
Publication date: 20/05/2015
Field of study

We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method -- called Policy Optimizer for Exponential Models (POEM) -- for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.Comment: 10 page

arXiv.org e-Print Archive

CiteSeerX

An Efficient Algorithm for Deep Stochastic Contextual Bandits

Author: Bi Jinbo
Li Haining
Liang Guannan
Zhu Chunjiang
Zhu Tan
Publication venue
Publication date: 12/04/2021
Field of study

In stochastic contextual bandit (SCB) problems, an agent selects an action based on certain observed context to maximize the cumulative reward over iterations. Recently there have been a few studies using a deep neural network (DNN) to predict the expected reward for an action, and the DNN is trained by a stochastic gradient based method. However, convergence analysis has been greatly ignored to examine whether and where these methods converge. In this work, we formulate the SCB that uses a DNN reward function as a non-convex stochastic optimization problem, and design a stage-wise stochastic gradient descent algorithm to optimize the problem and determine the action policy. We prove that with high probability, the action sequence chosen by this algorithm converges to a greedy action policy respecting a local optimal reward function. Extensive experiments have been performed to demonstrate the effectiveness and efficiency of the proposed algorithm on multiple real-world datasets.Comment: Accepted by AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications