70 research outputs found
A Common Misassumption in Online Experiments with Machine Learning Models
Online experiments such as Randomised Controlled Trials (RCTs) or A/B-tests
are the bread and butter of modern platforms on the web. They are conducted
continuously to allow platforms to estimate the causal effect of replacing
system variant "A" with variant "B", on some metric of interest. These variants
can differ in many aspects. In this paper, we focus on the common use-case
where they correspond to machine learning models. The online experiment then
serves as the final arbiter to decide which model is superior, and should thus
be shipped.
The statistical literature on causal effect estimation from RCTs has a
substantial history, which contributes deservedly to the level of trust
researchers and practitioners have in this "gold standard" of evaluation
practices. Nevertheless, in the particular case of machine learning
experiments, we remark that certain critical issues remain. Specifically, the
assumptions that are required to ascertain that A/B-tests yield unbiased
estimates of the causal effect, are seldom met in practical applications. We
argue that, because variants typically learn using pooled data, a lack of model
interference cannot be guaranteed. This undermines the conclusions we can draw
from online experiments with machine learning models. We discuss the
implications this has for practitioners, and for the research literature
Offline Recommender System Evaluation under Unobserved Confounding
Off-Policy Estimation (OPE) methods allow us to learn and evaluate
decision-making policies from logged data. This makes them an attractive choice
for the offline evaluation of recommender systems, and several recent works
have reported successful adoption of OPE methods to this end. An important
assumption that makes this work is the absence of unobserved confounders:
random variables that influence both actions and rewards at data collection
time. Because the data collection policy is typically under the practitioner's
control, the unconfoundedness assumption is often left implicit, and its
violations are rarely dealt with in the existing literature.
This work aims to highlight the problems that arise when performing
off-policy estimation in the presence of unobserved confounders, specifically
focusing on a recommendation use-case. We focus on policy-based estimators,
where the logging propensities are learned from logged data. We characterise
the statistical bias that arises due to confounding, and show how existing
diagnostics are unable to uncover such cases. Because the bias depends directly
on the true and unobserved logging propensities, it is non-identifiable. As the
unconfoundedness assumption is famously untestable, this becomes especially
problematic. This paper emphasises this common, yet often overlooked issue.
Through synthetic data, we empirically show how na\"ive propensity estimation
under confounding can lead to severely biased metric estimates that are allowed
to fly under the radar. We aim to cultivate an awareness among researchers and
practitioners of this important problem, and touch upon potential research
directions towards mitigating its effects.Comment: Accepted at the CONSEQUENCES'23 workshop at RecSys '2
On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top- Recommendation
Approaches to recommendation are typically evaluated in one of two ways: (1)
via a (simulated) online experiment, often seen as the gold standard, or (2)
via some offline evaluation procedure, where the goal is to approximate the
outcome of an online experiment. Several offline evaluation metrics have been
adopted in the literature, inspired by ranking metrics prevalent in the field
of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one
such metric that has seen widespread adoption in empirical studies, and higher
(n)DCG values have been used to present new methods as the state-of-the-art in
top- recommendation for many years.
Our work takes a critical look at this approach, and investigates when we can
expect such metrics to approximate the gold standard outcome of an online
experiment. We formally present the assumptions that are necessary to consider
DCG an unbiased estimator of online reward and provide a derivation for this
metric from first principles, highlighting where we deviate from its
traditional uses in IR. Importantly, we show that normalising the metric
renders it inconsistent, in that even when DCG is unbiased, ranking competing
methods by their normalised DCG can invert their relative order. Through a
correlation analysis between off- and on-line experiments conducted on a
large-scale recommendation platform, we show that our unbiased DCG estimates
strongly correlate with online reward, even when some of the metric's inherent
assumptions are violated. This statement no longer holds for its normalised
variant, suggesting that nDCG's practical utility may be limited
RecFusion: A Binomial Diffusion Process for 1D Data for Recommendation
In this paper we propose RecFusion, which comprise a set of diffusion models
for recommendation. Unlike image data which contain spatial correlations, a
user-item interaction matrix, commonly utilized in recommendation, lacks
spatial relationships between users and items. We formulate diffusion on a 1D
vector and propose binomial diffusion, which explicitly models binary user-item
interactions with a Bernoulli process. We show that RecFusion approaches the
performance of complex VAE baselines on the core recommendation setting (top-n
recommendation for binary non-sequential feedback) and the most common datasets
(MovieLens and Netflix). Our proposed diffusion models that are specialized for
1D and/or binary setups have implications beyond recommendation systems, such
as in the medical domain with MRI and CT scans.Comment: code: https://github.com/gabriben/recfusio
Offline Evaluation of Reward-Optimizing Recommender Systems: The Case of Simulation
Both in academic and industry-based research, online evaluation methods are
seen as the golden standard for interactive applications like recommendation
systems. Naturally, the reason for this is that we can directly measure utility
metrics that rely on interventions, being the recommendations that are being
shown to users. Nevertheless, online evaluation methods are costly for a number
of reasons, and a clear need remains for reliable offline evaluation
procedures. In industry, offline metrics are often used as a first-line
evaluation to generate promising candidate models to evaluate online. In
academic work, limited access to online systems makes offline metrics the de
facto approach to validating novel methods. Two classes of offline metrics
exist: proxy-based methods, and counterfactual methods. The first class is
often poorly correlated with the online metrics we care about, and the latter
class only provides theoretical guarantees under assumptions that cannot be
fulfilled in real-world environments. Here, we make the case that
simulation-based comparisons provide ways forward beyond offline metrics, and
argue that they are a preferable means of evaluation.Comment: Accepted at the ACM RecSys 2021 Workshop on Simulation Methods for
Recommender System
BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals
A common task for recommender systems is to build a pro le of the interests
of a user from items in their browsing history and later to recommend items to
the user from the same catalog. The users' behavior consists of two parts: the
sequence of items that they viewed without intervention (the organic part) and
the sequences of items recommended to them and their outcome (the bandit part).
In this paper, we propose Bayesian Latent Organic Bandit model (BLOB), a
probabilistic approach to combine the 'or-ganic' and 'bandit' signals in order
to improve the estimation of recommendation quality. The bandit signal is
valuable as it gives direct feedback of recommendation performance, but the
signal quality is very uneven, as it is highly concentrated on the
recommendations deemed optimal by the past version of the recom-mender system.
In contrast, the organic signal is typically strong and covers most items, but
is not always relevant to the recommendation task. In order to leverage the
organic signal to e ciently learn the bandit signal in a Bayesian model we
identify three fundamental types of distances, namely action-history,
action-action and history-history distances. We implement a scalable
approximation of the full model using variational auto-encoders and the local
re-paramerization trick. We show using extensive simulation studies that our
method out-performs or matches the value of both state-of-the-art organic-based
recommendation algorithms, and of bandit-based methods (both value and
policy-based) both in organic and bandit-rich environments.Comment: 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,
Aug 2020, San Diego, United State
Bio-On-Magnetic-Beads (BOMB): Open platform for high-throughput nucleic acid extraction and manipulation
Current molecular biology laboratories rely heavily on the purification and manipulation of nucleic acids. Yet, commonly used centrifuge- and column-based protocols require specialised equipment, often use toxic reagents and are not economically scalable or practical to use in a high-throughput manner. Although it has been known for some time that magnetic beads can provide an elegant answer to these issues, the development of open-source protocols based on beads has been limited. In this article, we provide step-by-step instructions for an easy synthesis of functionalised magnetic beads, and detailed protocols for their use in the high-throughput purification of plasmids, genomic DNA and total RNA from different sources, as well as environmental TNA and PCR amplicons. We also provide a bead-based protocol for bisulfite conversion, and size selection of DNA and RNA fragments. Comparison to other methods highlights the capability, versatility and extreme cost-effectiveness of using magnetic beads. These open source protocols and the associated webpage (https://bomb.bio) can serve as a platform for further protocol customisation and community engagement
- …