78 research outputs found
FastMMD: Ensemble of Circular Discrepancy for Efficient Two-Sample Test
The maximum mean discrepancy (MMD) is a recently proposed test statistic for
two-sample test. Its quadratic time complexity, however, greatly hampers its
availability to large-scale applications. To accelerate the MMD calculation, in
this study we propose an efficient method called FastMMD. The core idea of
FastMMD is to equivalently transform the MMD with shift-invariant kernels into
the amplitude expectation of a linear combination of sinusoid components based
on Bochner's theorem and Fourier transform (Rahimi & Recht, 2007). Taking
advantage of sampling of Fourier transform, FastMMD decreases the time
complexity for MMD calculation from to , where and
are the size and dimension of the sample set, respectively. Here is the
number of basis functions for approximating kernels which determines the
approximation accuracy. For kernels that are spherically invariant, the
computation can be further accelerated to by using the Fastfood
technique (Le et al., 2013). The uniform convergence of our method has also
been theoretically proved in both unbiased and biased estimates. We have
further provided a geometric explanation for our method, namely ensemble of
circular discrepancy, which facilitates us to understand the insight of MMD,
and is hopeful to help arouse more extensive metrics for assessing two-sample
test. Experimental results substantiate that FastMMD is with similar accuracy
as exact MMD, while with faster computation speed and lower variance than the
existing MMD approximation methods
Equivalence of distance-based and RKHS-based statistics in hypothesis testing
We provide a unifying framework linking two classes of statistics used in
two-sample and independence testing: on the one hand, the energy distances and
distance covariances from the statistics literature; on the other, maximum mean
discrepancies (MMD), that is, distances between embeddings of distributions to
reproducing kernel Hilbert spaces (RKHS), as established in machine learning.
In the case where the energy distance is computed with a semimetric of negative
type, a positive definite kernel, termed distance kernel, may be defined such
that the MMD corresponds exactly to the energy distance. Conversely, for any
positive definite kernel, we can interpret the MMD as energy distance with
respect to some negative-type semimetric. This equivalence readily extends to
distance covariance using kernels on the product space. We determine the class
of probability distributions for which the test statistics are consistent
against all alternatives. Finally, we investigate the performance of the family
of distance kernels in two-sample and independence tests: we show in particular
that the energy distance most commonly employed in statistics is just one
member of a parametric family of kernels, and that other choices from this
family can yield more powerful tests.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1140 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …