134 research outputs found
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning
from logged bandit feedback. This learning setting is ubiquitous in online
systems (e.g., ad placement, web search, recommendation), where an algorithm
makes a prediction (e.g., ad ranking) for a given input (e.g., query) and
observes bandit feedback (e.g., user clicks on presented ads). We first address
the counterfactual nature of the learning problem through propensity scoring.
Next, we prove generalization error bounds that account for the variance of the
propensity-weighted empirical risk estimator. These constructive bounds give
rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM
can be used to derive a new learning method -- called Policy Optimizer for
Exponential Models (POEM) -- for learning stochastic linear rules for
structured output prediction. We present a decomposition of the POEM objective
that enables efficient stochastic gradient optimization. POEM is evaluated on
several multi-label classification problems showing substantially improved
robustness and generalization performance compared to the state-of-the-art.Comment: 10 page
Counterfactual Learning from Bandit Feedback under Deterministic Logging: A Case Study in Statistical Machine Translation
The goal of counterfactual learning for statistical machine translation (SMT)
is to optimize a target SMT system from logged data that consist of user
feedback to translations that were predicted by another, historic SMT system. A
challenge arises by the fact that risk-averse commercial SMT systems
deterministically log the most probable translation. The lack of sufficient
exploration of the SMT output space seemingly contradicts the theoretical
requirements for counterfactual learning. We show that counterfactual learning
from deterministic bandit logs is possible nevertheless by smoothing out
deterministic components in learning. This can be achieved by additive and
multiplicative control variates that avoid degenerate behavior in empirical
risk minimization. Our simulation experiments show improvements of up to 2 BLEU
points by counterfactual learning from deterministic bandit feedback.Comment: Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2017, Copenhagen, Denmar
Distributionally Robust Counterfactual Risk Minimization
This manuscript introduces the idea of using Distributionally Robust
Optimization (DRO) for the Counterfactual Risk Minimization (CRM) problem.
Tapping into a rich existing literature, we show that DRO is a principled tool
for counterfactual decision making. We also show that well-established
solutions to the CRM problem like sample variance penalization schemes are
special instances of a more general DRO problem. In this unifying framework, a
variety of distributionally robust counterfactual risk estimators can be
constructed using various probability distances and divergences as uncertainty
measures. We propose the use of Kullback-Leibler divergence as an alternative
way to model uncertainty in CRM and derive a new robust counterfactual
objective. In our experiments, we show that this approach outperforms the
state-of-the-art on four benchmark datasets, validating the relevance of using
other uncertainty measures in practical applications.Comment: Accepted at AAAI2
- …