63 research outputs found
Learning from Logged Implicit Exploration Data
We provide a sound and consistent foundation for the use of \emph{nonrandom}
exploration data in "contextual bandit" or "partially labeled" settings where
only the value of a chosen action is learned.
The primary challenge in a variety of settings is that the exploration
policy, in which "offline" data is logged, is not explicitly known. Prior
solutions here require either control of the actions during the learning
process, recorded random exploration, or actions chosen obliviously in a
repeated manner. The techniques reported here lift these restrictions, allowing
the learning of a policy for choosing actions given features from historical
data where no randomization occurred or was logged.
We empirically verify our solution on two reasonably sized sets of real-world
data obtained from Yahoo!
Counterfactual Learning from Bandit Feedback under Deterministic Logging: A Case Study in Statistical Machine Translation
The goal of counterfactual learning for statistical machine translation (SMT)
is to optimize a target SMT system from logged data that consist of user
feedback to translations that were predicted by another, historic SMT system. A
challenge arises by the fact that risk-averse commercial SMT systems
deterministically log the most probable translation. The lack of sufficient
exploration of the SMT output space seemingly contradicts the theoretical
requirements for counterfactual learning. We show that counterfactual learning
from deterministic bandit logs is possible nevertheless by smoothing out
deterministic components in learning. This can be achieved by additive and
multiplicative control variates that avoid degenerate behavior in empirical
risk minimization. Our simulation experiments show improvements of up to 2 BLEU
points by counterfactual learning from deterministic bandit feedback.Comment: Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2017, Copenhagen, Denmar
Counterfactual Estimation and Optimization of Click Metrics for Search Engines
Optimizing an interactive system against a predefined online metric is
particularly challenging, when the metric is computed from user feedback such
as clicks and payments. The key challenge is the counterfactual nature: in the
case of Web search, any change to a component of the search engine may result
in a different search result page for the same query, but we normally cannot
infer reliably from search log how users would react to the new result page.
Consequently, it appears impossible to accurately estimate online metrics that
depend on user feedback, unless the new engine is run to serve users and
compared with a baseline in an A/B test. This approach, while valid and
successful, is unfortunately expensive and time-consuming. In this paper, we
propose to address this problem using causal inference techniques, under the
contextual-bandit framework. This approach effectively allows one to run
(potentially infinitely) many A/B tests offline from search log, making it
possible to estimate and optimize online metrics quickly and inexpensively.
Focusing on an important component in a commercial search engine, we show how
these ideas can be instantiated and applied, and obtain very promising results
that suggest the wide applicability of these techniques
Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average
We investigate the accuracy of the two most common estimators for the maximum
expected value of a general set of random variables: a generalization of the
maximum sample average, and cross validation. No unbiased estimator exists and
we show that it is non-trivial to select a good estimator without knowledge
about the distributions of the random variables. We investigate and bound the
bias and variance of the aforementioned estimators and prove consistency. The
variance of cross validation can be significantly reduced, but not without
risking a large bias. The bias and variance of different variants of cross
validation are shown to be very problem-dependent, and a wrong choice can lead
to very inaccurate estimates
- …