6 research outputs found

    Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix

    Full text link
    Surrogate index approaches have recently become a popular method of estimating longer-term impact from shorter-term outcomes. In this paper, we leverage 1098 test arms from 200 A/B tests at Netflix to empirically investigate to what degree would decisions made using a surrogate index utilizing 14 days of data would align with those made using direct measurement of day 63 treatment effects. Focusing specifically on linear "auto-surrogate" models that utilize the shorter-term observations of the long-term outcome of interest, we find that the statistical inferences that we would draw from using the surrogate index are ~95% consistent with those from directly measuring the long-term treatment effect. Moreover, when we restrict ourselves to the set of tests that would be "launched" (i.e. positive and statistically significant) based on the 63-day directly measured treatment effects, we find that relying instead on the surrogate index achieves 79% and 65% recall

    Continuous Experimentation for Automotive Software on the Example of a Heavy Commercial Vehicle in Daily Operation

    Full text link
    As the automotive industry focuses its attention more and more towards the software functionality of vehicles, techniques to deliver new software value at a fast pace are needed. Continuous Experimentation, a practice coming from the web-based systems world, is one of such techniques. It enables researchers and developers to use real-world data to verify their hypothesis and steer the software evolution based on performances and user preferences, reducing the reliance on simulations and guesswork. Several challenges prevent the verbatim adoption of this practice on automotive cyber-physical systems, e.g., safety concerns and limitations from computational resources; nonetheless, the automotive field is starting to take interest in this technique. This work aims at demonstrating and evaluating a prototypical Continuous Experimentation infrastructure, implemented on a distributed computational system housed in a commercial truck tractor that is used in daily operations by a logistic company on public roads. The system comprises computing units and sensors, and software deployment and data retrieval are only possible remotely via a mobile data connection due to the commercial interests of the logistics company. This study shows that the proposed experimentation process resulted in the development team being able to base software development choices on the real-world data collected during the experimental procedure. Additionally, a set of previously identified design criteria to enable Continuous Experimentation on automotive systems was discussed and their validity confirmed in the light of the presented work.Comment: Paper accepted to the 14th European Conference on Software Architecture (ECSA 2020). 16 pages, 5 figure

    Estimating Effects of Long-Term Treatments

    Full text link
    Estimating the effects of long-term treatments in A/B testing presents a significant challenge. Such treatments -- including updates to product functions, user interface designs, and recommendation algorithms -- are intended to remain in the system for a long period after their launches. On the other hand, given the constraints of conducting long-term experiments, practitioners often rely on short-term experimental results to make product launch decisions. It remains an open question how to accurately estimate the effects of long-term treatments using short-term experimental data. To address this question, we introduce a longitudinal surrogate framework. We show that, under standard assumptions, the effects of long-term treatments can be decomposed into a series of functions, which depend on the user attributes, the short-term intermediate metrics, and the treatment assignments. We describe the identification assumptions, the estimation strategies, and the inference technique under this framework. Empirically, we show that our approach outperforms existing solutions by leveraging two real-world experiments, each involving millions of users on WeChat, one of the world's largest social networking platforms
    corecore