Suppose an online platform wants to compare a treatment and control policy,
e.g., two different matching algorithms in a ridesharing system, or two
different inventory management algorithms in an online retail site. Standard
randomized controlled trials are typically not feasible, since the goal is to
estimate policy performance on the entire system. Instead, the typical current
practice involves dynamically alternating between the two policies for fixed
lengths of time, and comparing the average performance of each over the
intervals in which they were run as an estimate of the treatment effect.
However, this approach suffers from *temporal interference*: one algorithm
alters the state of the system as seen by the second algorithm, biasing
estimates of the treatment effect. Further, the simple non-adaptive nature of
such designs implies they are not sample efficient.
We develop a benchmark theoretical model in which to study optimal
experimental design for this setting. We view testing the two policies as the
problem of estimating the steady state difference in reward between two unknown
Markov chains (i.e., policies). We assume estimation of the steady state reward
for each chain proceeds via nonparametric maximum likelihood, and search for
consistent (i.e., asymptotically unbiased) experimental designs that are
efficient (i.e., asymptotically minimum variance). Characterizing such designs
is equivalent to a Markov decision problem with a minimum variance objective;
such problems generally do not admit tractable solutions. Remarkably, in our
setting, using a novel application of classical martingale analysis of Markov
chains via Poisson's equation, we characterize efficient designs via a succinct
convex optimization problem. We use this characterization to propose a
consistent, efficient online experimental design that adaptively samples the
two Markov chains