539 research outputs found
Knowledge Infused Policy Gradients with Upper Confidence Bound for Relational Bandits
Contextual Bandits find important use cases in various real-life scenarios
such as online advertising, recommendation systems, healthcare, etc. However,
most of the algorithms use flat feature vectors to represent context whereas,
in the real world, there is a varying number of objects and relations among
them to model in the context. For example, in a music recommendation system,
the user context contains what music they listen to, which artists create this
music, the artist albums, etc. Adding richer relational context representations
also introduces a much larger context space making exploration-exploitation
harder. To improve the efficiency of exploration-exploitation knowledge about
the context can be infused to guide the exploration-exploitation strategy.
Relational context representations allow a natural way for humans to specify
knowledge owing to their descriptive nature. We propose an adaptation of
Knowledge Infused Policy Gradients to the Contextual Bandit setting and a novel
Knowledge Infused Policy Gradients Upper Confidence Bound algorithm and perform
an experimental analysis of a simulated music recommendation dataset and
various real-life datasets where expert knowledge can drastically reduce the
total regret and where it cannot.Comment: Accepted for publication in the research track at ECML-PKDD 202
Contextual Linear Bandits under Noisy Features: Towards Bayesian Oracles
We study contextual linear bandit problems under uncertainty on features;
they are noisy with missing entries. To address the challenges from the noise,
we analyze Bayesian oracles given observed noisy features. Our Bayesian
analysis finds that the optimal hypothesis can be far from the underlying
realizability function, depending on noise characteristics, which is highly
non-intuitive and does not occur for classical noiseless setups. This implies
that classical approaches cannot guarantee a non-trivial regret bound. We thus
propose an algorithm aiming at the Bayesian oracle from observed information
under this model, achieving regret bound with respect to
feature dimension and time horizon . We demonstrate the proposed
algorithm using synthetic and real-world datasets.Comment: 30 page
Reinforcement learning for bandits with continuous actions and large context spaces
We consider the challenging scenario of contextual bandits with continuous actions and large context spaces. This is an increasingly important application area in personalised healthcare where an agent is requested to make dosing decisions based on a patient’s single image scan. In this paper, we first adapt a reinforcement learning (RL) algorithm for continuous control to outperform contextual bandit algorithms specifically hand-crafted for continuous action spaces. We empirically demonstrate this on a suite of standard benchmark datasets for vector contexts. Secondly, we demonstrate that our RL agent can generalise problems with continuous actions to large context spaces, providing results that outperform previous methods on image contexts. Thirdly, we introduce a new contextual bandits test domain with multi-dimensional continuous action space and image contexts which existing tree-based methods cannot handle. We provide initial results with our RL agent
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning
from logged bandit feedback. This learning setting is ubiquitous in online
systems (e.g., ad placement, web search, recommendation), where an algorithm
makes a prediction (e.g., ad ranking) for a given input (e.g., query) and
observes bandit feedback (e.g., user clicks on presented ads). We first address
the counterfactual nature of the learning problem through propensity scoring.
Next, we prove generalization error bounds that account for the variance of the
propensity-weighted empirical risk estimator. These constructive bounds give
rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM
can be used to derive a new learning method -- called Policy Optimizer for
Exponential Models (POEM) -- for learning stochastic linear rules for
structured output prediction. We present a decomposition of the POEM objective
that enables efficient stochastic gradient optimization. POEM is evaluated on
several multi-label classification problems showing substantially improved
robustness and generalization performance compared to the state-of-the-art.Comment: 10 page
An Efficient Algorithm for Deep Stochastic Contextual Bandits
In stochastic contextual bandit (SCB) problems, an agent selects an action
based on certain observed context to maximize the cumulative reward over
iterations. Recently there have been a few studies using a deep neural network
(DNN) to predict the expected reward for an action, and the DNN is trained by a
stochastic gradient based method. However, convergence analysis has been
greatly ignored to examine whether and where these methods converge. In this
work, we formulate the SCB that uses a DNN reward function as a non-convex
stochastic optimization problem, and design a stage-wise stochastic gradient
descent algorithm to optimize the problem and determine the action policy. We
prove that with high probability, the action sequence chosen by this algorithm
converges to a greedy action policy respecting a local optimal reward function.
Extensive experiments have been performed to demonstrate the effectiveness and
efficiency of the proposed algorithm on multiple real-world datasets.Comment: Accepted by AAAI 202
- …