70 research outputs found

    A Common Misassumption in Online Experiments with Machine Learning Models

    Full text link
    Online experiments such as Randomised Controlled Trials (RCTs) or A/B-tests are the bread and butter of modern platforms on the web. They are conducted continuously to allow platforms to estimate the causal effect of replacing system variant "A" with variant "B", on some metric of interest. These variants can differ in many aspects. In this paper, we focus on the common use-case where they correspond to machine learning models. The online experiment then serves as the final arbiter to decide which model is superior, and should thus be shipped. The statistical literature on causal effect estimation from RCTs has a substantial history, which contributes deservedly to the level of trust researchers and practitioners have in this "gold standard" of evaluation practices. Nevertheless, in the particular case of machine learning experiments, we remark that certain critical issues remain. Specifically, the assumptions that are required to ascertain that A/B-tests yield unbiased estimates of the causal effect, are seldom met in practical applications. We argue that, because variants typically learn using pooled data, a lack of model interference cannot be guaranteed. This undermines the conclusions we can draw from online experiments with machine learning models. We discuss the implications this has for practitioners, and for the research literature

    Offline Recommender System Evaluation under Unobserved Confounding

    Full text link
    Off-Policy Estimation (OPE) methods allow us to learn and evaluate decision-making policies from logged data. This makes them an attractive choice for the offline evaluation of recommender systems, and several recent works have reported successful adoption of OPE methods to this end. An important assumption that makes this work is the absence of unobserved confounders: random variables that influence both actions and rewards at data collection time. Because the data collection policy is typically under the practitioner's control, the unconfoundedness assumption is often left implicit, and its violations are rarely dealt with in the existing literature. This work aims to highlight the problems that arise when performing off-policy estimation in the presence of unobserved confounders, specifically focusing on a recommendation use-case. We focus on policy-based estimators, where the logging propensities are learned from logged data. We characterise the statistical bias that arises due to confounding, and show how existing diagnostics are unable to uncover such cases. Because the bias depends directly on the true and unobserved logging propensities, it is non-identifiable. As the unconfoundedness assumption is famously untestable, this becomes especially problematic. This paper emphasises this common, yet often overlooked issue. Through synthetic data, we empirically show how na\"ive propensity estimation under confounding can lead to severely biased metric estimates that are allowed to fly under the radar. We aim to cultivate an awareness among researchers and practitioners of this important problem, and touch upon potential research directions towards mitigating its effects.Comment: Accepted at the CONSEQUENCES'23 workshop at RecSys '2

    On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-nn Recommendation

    Full text link
    Approaches to recommendation are typically evaluated in one of two ways: (1) via a (simulated) online experiment, often seen as the gold standard, or (2) via some offline evaluation procedure, where the goal is to approximate the outcome of an online experiment. Several offline evaluation metrics have been adopted in the literature, inspired by ranking metrics prevalent in the field of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies, and higher (n)DCG values have been used to present new methods as the state-of-the-art in top-nn recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited

    RecFusion: A Binomial Diffusion Process for 1D Data for Recommendation

    Full text link
    In this paper we propose RecFusion, which comprise a set of diffusion models for recommendation. Unlike image data which contain spatial correlations, a user-item interaction matrix, commonly utilized in recommendation, lacks spatial relationships between users and items. We formulate diffusion on a 1D vector and propose binomial diffusion, which explicitly models binary user-item interactions with a Bernoulli process. We show that RecFusion approaches the performance of complex VAE baselines on the core recommendation setting (top-n recommendation for binary non-sequential feedback) and the most common datasets (MovieLens and Netflix). Our proposed diffusion models that are specialized for 1D and/or binary setups have implications beyond recommendation systems, such as in the medical domain with MRI and CT scans.Comment: code: https://github.com/gabriben/recfusio

    Offline Evaluation of Reward-Optimizing Recommender Systems: The Case of Simulation

    Full text link
    Both in academic and industry-based research, online evaluation methods are seen as the golden standard for interactive applications like recommendation systems. Naturally, the reason for this is that we can directly measure utility metrics that rely on interventions, being the recommendations that are being shown to users. Nevertheless, online evaluation methods are costly for a number of reasons, and a clear need remains for reliable offline evaluation procedures. In industry, offline metrics are often used as a first-line evaluation to generate promising candidate models to evaluate online. In academic work, limited access to online systems makes offline metrics the de facto approach to validating novel methods. Two classes of offline metrics exist: proxy-based methods, and counterfactual methods. The first class is often poorly correlated with the online metrics we care about, and the latter class only provides theoretical guarantees under assumptions that cannot be fulfilled in real-world environments. Here, we make the case that simulation-based comparisons provide ways forward beyond offline metrics, and argue that they are a preferable means of evaluation.Comment: Accepted at the ACM RecSys 2021 Workshop on Simulation Methods for Recommender System

    BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals

    Full text link
    A common task for recommender systems is to build a pro le of the interests of a user from items in their browsing history and later to recommend items to the user from the same catalog. The users' behavior consists of two parts: the sequence of items that they viewed without intervention (the organic part) and the sequences of items recommended to them and their outcome (the bandit part). In this paper, we propose Bayesian Latent Organic Bandit model (BLOB), a probabilistic approach to combine the 'or-ganic' and 'bandit' signals in order to improve the estimation of recommendation quality. The bandit signal is valuable as it gives direct feedback of recommendation performance, but the signal quality is very uneven, as it is highly concentrated on the recommendations deemed optimal by the past version of the recom-mender system. In contrast, the organic signal is typically strong and covers most items, but is not always relevant to the recommendation task. In order to leverage the organic signal to e ciently learn the bandit signal in a Bayesian model we identify three fundamental types of distances, namely action-history, action-action and history-history distances. We implement a scalable approximation of the full model using variational auto-encoders and the local re-paramerization trick. We show using extensive simulation studies that our method out-performs or matches the value of both state-of-the-art organic-based recommendation algorithms, and of bandit-based methods (both value and policy-based) both in organic and bandit-rich environments.Comment: 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Aug 2020, San Diego, United State

    Bio-On-Magnetic-Beads (BOMB): Open platform for high-throughput nucleic acid extraction and manipulation

    Get PDF
    Current molecular biology laboratories rely heavily on the purification and manipulation of nucleic acids. Yet, commonly used centrifuge- and column-based protocols require specialised equipment, often use toxic reagents and are not economically scalable or practical to use in a high-throughput manner. Although it has been known for some time that magnetic beads can provide an elegant answer to these issues, the development of open-source protocols based on beads has been limited. In this article, we provide step-by-step instructions for an easy synthesis of functionalised magnetic beads, and detailed protocols for their use in the high-throughput purification of plasmids, genomic DNA and total RNA from different sources, as well as environmental TNA and PCR amplicons. We also provide a bead-based protocol for bisulfite conversion, and size selection of DNA and RNA fragments. Comparison to other methods highlights the capability, versatility and extreme cost-effectiveness of using magnetic beads. These open source protocols and the associated webpage (https://bomb.bio) can serve as a platform for further protocol customisation and community engagement
    • …
    corecore