We propose a class of nonparametric two-sample tests with a cost linear in
the sample size. Two tests are given, both based on an ensemble of distances
between analytic functions representing each of the distributions. The first
test uses smoothed empirical characteristic functions to represent the
distributions, the second uses distribution embeddings in a reproducing kernel
Hilbert space. Analyticity implies that differences in the distributions may be
detected almost surely at a finite number of randomly chosen
locations/frequencies. The new tests are consistent against a larger class of
alternatives than the previous linear-time tests based on the (non-smoothed)
empirical characteristic functions, while being much faster than the current
state-of-the-art quadratic-time kernel-based or energy distance-based tests.
Experiments on artificial benchmarks and on challenging real-world testing
problems demonstrate that our tests give a better power/time tradeoff than
competing approaches, and in some cases, better outright power than even the
most expensive quadratic-time tests. This performance advantage is retained
even in high dimensions, and in cases where the difference in distributions is
not observable with low order statistics