54 research outputs found
Cascade Model-based Propensity Estimation for Counterfactual Learning to Rank
Unbiased CLTR requires click propensities to compensate for the difference
between user clicks and true relevance of search results via IPS. Current
propensity estimation methods assume that user click behavior follows the PBM
and estimate click propensities based on this assumption. However, in reality,
user clicks often follow the CM, where users scan search results from top to
bottom and where each next click depends on the previous one. In this cascade
scenario, PBM-based estimates of propensities are not accurate, which, in turn,
hurts CLTR performance. In this paper, we propose a propensity estimation
method for the cascade scenario, called CM-IPS. We show that CM-IPS keeps CLTR
performance close to the full-information performance in case the user clicks
follow the CM, while PBM-based CLTR has a significant gap towards the
full-information. The opposite is true if the user clicks follow PBM instead of
the CM. Finally, we suggest a way to select between CM- and PBM-based
propensity estimation methods based on historical user clicks.Comment: 4 pages, 2 figures, 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR '20
Double Clipping: Less-Biased Variance Reduction in Off-Policy Evaluation
"Clipping" (a.k.a. importance weight truncation) is a widely used
variance-reduction technique for counterfactual off-policy estimators. Like
other variance-reduction techniques, clipping reduces variance at the cost of
increased bias. However, unlike other techniques, the bias introduced by
clipping is always a downward bias (assuming non-negative rewards), yielding a
lower bound on the true expected reward. In this work we propose a simple
extension, called , which aims to compensate this
downward bias and thus reduce the overall bias, while maintaining the variance
reduction properties of the original estimator.Comment: Presented at CONSEQUENCES '23 workshop at RecSys 2023 conference in
Singapor
Counterfactual Learning from Bandit Feedback under Deterministic Logging: A Case Study in Statistical Machine Translation
The goal of counterfactual learning for statistical machine translation (SMT)
is to optimize a target SMT system from logged data that consist of user
feedback to translations that were predicted by another, historic SMT system. A
challenge arises by the fact that risk-averse commercial SMT systems
deterministically log the most probable translation. The lack of sufficient
exploration of the SMT output space seemingly contradicts the theoretical
requirements for counterfactual learning. We show that counterfactual learning
from deterministic bandit logs is possible nevertheless by smoothing out
deterministic components in learning. This can be achieved by additive and
multiplicative control variates that avoid degenerate behavior in empirical
risk minimization. Our simulation experiments show improvements of up to 2 BLEU
points by counterfactual learning from deterministic bandit feedback.Comment: Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2017, Copenhagen, Denmar
- …