137 research outputs found
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning
from logged bandit feedback. This learning setting is ubiquitous in online
systems (e.g., ad placement, web search, recommendation), where an algorithm
makes a prediction (e.g., ad ranking) for a given input (e.g., query) and
observes bandit feedback (e.g., user clicks on presented ads). We first address
the counterfactual nature of the learning problem through propensity scoring.
Next, we prove generalization error bounds that account for the variance of the
propensity-weighted empirical risk estimator. These constructive bounds give
rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM
can be used to derive a new learning method -- called Policy Optimizer for
Exponential Models (POEM) -- for learning stochastic linear rules for
structured output prediction. We present a decomposition of the POEM objective
that enables efficient stochastic gradient optimization. POEM is evaluated on
several multi-label classification problems showing substantially improved
robustness and generalization performance compared to the state-of-the-art.Comment: 10 page
Balanced Linear Contextual Bandits
Contextual bandit algorithms are sensitive to the estimation method of the
outcome model as well as the exploration method used, particularly in the
presence of rich heterogeneity or complex outcome models, which can lead to
difficult estimation problems along the path of learning. We develop algorithms
for contextual bandits with linear payoffs that integrate balancing methods
from the causal inference literature in their estimation to make it less prone
to problems of estimation bias. We provide the first regret bound analyses for
linear contextual bandits with balancing and show that our algorithms match the
state of the art theoretical guarantees. We demonstrate the strong practical
advantage of balanced contextual bandits on a large number of supervised
learning datasets and on a synthetic example that simulates model
misspecification and prejudice in the initial training data.Comment: AAAI 2019 Oral Presentation. arXiv admin note: substantial text
overlap with arXiv:1711.0707
- …