6 research outputs found
Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix
Surrogate index approaches have recently become a popular method of
estimating longer-term impact from shorter-term outcomes. In this paper, we
leverage 1098 test arms from 200 A/B tests at Netflix to empirically
investigate to what degree would decisions made using a surrogate index
utilizing 14 days of data would align with those made using direct measurement
of day 63 treatment effects. Focusing specifically on linear "auto-surrogate"
models that utilize the shorter-term observations of the long-term outcome of
interest, we find that the statistical inferences that we would draw from using
the surrogate index are ~95% consistent with those from directly measuring the
long-term treatment effect. Moreover, when we restrict ourselves to the set of
tests that would be "launched" (i.e. positive and statistically significant)
based on the 63-day directly measured treatment effects, we find that relying
instead on the surrogate index achieves 79% and 65% recall
Continuous Experimentation for Automotive Software on the Example of a Heavy Commercial Vehicle in Daily Operation
As the automotive industry focuses its attention more and more towards the
software functionality of vehicles, techniques to deliver new software value at
a fast pace are needed. Continuous Experimentation, a practice coming from the
web-based systems world, is one of such techniques. It enables researchers and
developers to use real-world data to verify their hypothesis and steer the
software evolution based on performances and user preferences, reducing the
reliance on simulations and guesswork. Several challenges prevent the verbatim
adoption of this practice on automotive cyber-physical systems, e.g., safety
concerns and limitations from computational resources; nonetheless, the
automotive field is starting to take interest in this technique. This work aims
at demonstrating and evaluating a prototypical Continuous Experimentation
infrastructure, implemented on a distributed computational system housed in a
commercial truck tractor that is used in daily operations by a logistic company
on public roads. The system comprises computing units and sensors, and software
deployment and data retrieval are only possible remotely via a mobile data
connection due to the commercial interests of the logistics company. This study
shows that the proposed experimentation process resulted in the development
team being able to base software development choices on the real-world data
collected during the experimental procedure. Additionally, a set of previously
identified design criteria to enable Continuous Experimentation on automotive
systems was discussed and their validity confirmed in the light of the
presented work.Comment: Paper accepted to the 14th European Conference on Software
Architecture (ECSA 2020). 16 pages, 5 figure
Estimating Effects of Long-Term Treatments
Estimating the effects of long-term treatments in A/B testing presents a
significant challenge. Such treatments -- including updates to product
functions, user interface designs, and recommendation algorithms -- are
intended to remain in the system for a long period after their launches. On the
other hand, given the constraints of conducting long-term experiments,
practitioners often rely on short-term experimental results to make product
launch decisions. It remains an open question how to accurately estimate the
effects of long-term treatments using short-term experimental data. To address
this question, we introduce a longitudinal surrogate framework. We show that,
under standard assumptions, the effects of long-term treatments can be
decomposed into a series of functions, which depend on the user attributes, the
short-term intermediate metrics, and the treatment assignments. We describe the
identification assumptions, the estimation strategies, and the inference
technique under this framework. Empirically, we show that our approach
outperforms existing solutions by leveraging two real-world experiments, each
involving millions of users on WeChat, one of the world's largest social
networking platforms