38 research outputs found
Effective Evaluation using Logged Bandit Feedback from Multiple Loggers
Accurately evaluating new policies (e.g. ad-placement models, ranking
functions, recommendation functions) is one of the key prerequisites for
improving interactive systems. While the conventional approach to evaluation
relies on online A/B tests, recent work has shown that counterfactual
estimators can provide an inexpensive and fast alternative, since they can be
applied offline using log data that was collected from a different policy
fielded in the past. In this paper, we address the question of how to estimate
the performance of a new target policy when we have log data from multiple
historic policies. This question is of great relevance in practice, since
policies get updated frequently in most online systems. We show that naively
combining data from multiple logging policies can be highly suboptimal. In
particular, we find that the standard Inverse Propensity Score (IPS) estimator
suffers especially when logging and target policies diverge -- to a point where
throwing away data improves the variance of the estimator. We therefore propose
two alternative estimators which we characterize theoretically and compare
experimentally. We find that the new estimators can provide substantially
improved estimation accuracy.Comment: KDD 201
Estimating Position Bias without Intrusive Interventions
Presentation bias is one of the key challenges when learning from implicit
feedback in search engines, as it confounds the relevance signal. While it was
recently shown how counterfactual learning-to-rank (LTR) approaches
\cite{Joachims/etal/17a} can provably overcome presentation bias when
observation propensities are known, it remains to show how to effectively
estimate these propensities. In this paper, we propose the first method for
producing consistent propensity estimates without manual relevance judgments,
disruptive interventions, or restrictive relevance modeling assumptions. First,
we show how to harvest a specific type of intervention data from historic
feedback logs of multiple different ranking functions, and show that this data
is sufficient for consistent propensity estimation in the position-based model.
Second, we propose a new extremum estimator that makes effective use of this
data. In an empirical evaluation, we find that the new estimator provides
superior propensity estimates in two real-world systems -- Arxiv Full-text
Search and Google Drive Search. Beyond these two points, we find that the
method is robust to a wide range of settings in simulation studies
Policy-Adaptive Estimator Selection for Off-Policy Evaluation
Off-policy evaluation (OPE) aims to accurately evaluate the performance of
counterfactual policies using only offline logged data. Although many
estimators have been developed, there is no single estimator that dominates the
others, because the estimators' accuracy can vary greatly depending on a given
OPE task such as the evaluation policy, number of actions, and noise level.
Thus, the data-driven estimator selection problem is becoming increasingly
important and can have a significant impact on the accuracy of OPE. However,
identifying the most accurate estimator using only the logged data is quite
challenging because the ground-truth estimation accuracy of estimators is
generally unavailable. This paper studies this challenging problem of estimator
selection for OPE for the first time. In particular, we enable an estimator
selection that is adaptive to a given OPE task, by appropriately subsampling
available logged data and constructing pseudo policies useful for the
underlying estimator selection task. Comprehensive experiments on both
synthetic and real-world company data demonstrate that the proposed procedure
substantially improves the estimator selection compared to a non-adaptive
heuristic.Comment: accepted at AAAI'2
Safe Deployment for Counterfactual Learning to Rank with Exposure-Based Risk Minimization
Counterfactual learning to rank (CLTR) relies on exposure-based inverse
propensity scoring (IPS), a LTR-specific adaptation of IPS to correct for
position bias. While IPS can provide unbiased and consistent estimates, it
often suffers from high variance. Especially when little click data is
available, this variance can cause CLTR to learn sub-optimal ranking behavior.
Consequently, existing CLTR methods bring significant risks with them, as
naively deploying their models can result in very negative user experiences. We
introduce a novel risk-aware CLTR method with theoretical guarantees for safe
deployment. We apply a novel exposure-based concept of risk regularization to
IPS estimation for LTR. Our risk regularization penalizes the mismatch between
the ranking behavior of a learned model and a given safe model. Thereby, it
ensures that learned ranking models stay close to a trusted model, when there
is high uncertainty in IPS estimation, which greatly reduces the risks during
deployment. Our experimental results demonstrate the efficacy of our proposed
method, which is effective at avoiding initial periods of bad performance when
little data is available, while also maintaining high performance at
convergence. For the CLTR field, our novel exposure-based risk minimization
method enables practitioners to adopt CLTR methods in a safer manner that
mitigates many of the risks attached to previous methods.Comment: SIGIR 2023 - Full pape
On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top- Recommendation
Approaches to recommendation are typically evaluated in one of two ways: (1)
via a (simulated) online experiment, often seen as the gold standard, or (2)
via some offline evaluation procedure, where the goal is to approximate the
outcome of an online experiment. Several offline evaluation metrics have been
adopted in the literature, inspired by ranking metrics prevalent in the field
of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one
such metric that has seen widespread adoption in empirical studies, and higher
(n)DCG values have been used to present new methods as the state-of-the-art in
top- recommendation for many years.
Our work takes a critical look at this approach, and investigates when we can
expect such metrics to approximate the gold standard outcome of an online
experiment. We formally present the assumptions that are necessary to consider
DCG an unbiased estimator of online reward and provide a derivation for this
metric from first principles, highlighting where we deviate from its
traditional uses in IR. Importantly, we show that normalising the metric
renders it inconsistent, in that even when DCG is unbiased, ranking competing
methods by their normalised DCG can invert their relative order. Through a
correlation analysis between off- and on-line experiments conducted on a
large-scale recommendation platform, we show that our unbiased DCG estimates
strongly correlate with online reward, even when some of the metric's inherent
assumptions are violated. This statement no longer holds for its normalised
variant, suggesting that nDCG's practical utility may be limited