251 research outputs found
Global and Local Two-Sample Tests via Regression
Two-sample testing is a fundamental problem in statistics. Despite its long
history, there has been renewed interest in this problem with the advent of
high-dimensional and complex data. Specifically, in the machine learning
literature, there have been recent methodological developments such as
classification accuracy tests. The goal of this work is to present a regression
approach to comparing multivariate distributions of complex data. Depending on
the chosen regression model, our framework can efficiently handle different
types of variables and various structures in the data, with competitive power
under many practical scenarios. Whereas previous work has been largely limited
to global tests which conceal much of the local information, our approach
naturally leads to a local two-sample testing framework in which we identify
local differences between multivariate distributions with statistical
confidence. We demonstrate the efficacy of our approach both theoretically and
empirically, under some well-known parametric and nonparametric regression
methods. Our proposed methods are applied to simulated data as well as a
challenging astronomy data set to assess their practical usefulness
Remember the Curse of Dimensionality: The Case of Goodness-of-Fit Testing in Arbitrary Dimension
Despite a substantial literature on nonparametric two-sample goodness-of-fit
testing in arbitrary dimensions spanning decades, there is no mention there of
any curse of dimensionality. Only more recently Ramdas et al. (2015) have
discussed this issue in the context of kernel methods by showing that their
performance degrades with the dimension even when the underlying distributions
are isotropic Gaussians. We take a minimax perspective and follow in the
footsteps of Ingster (1987) to derive the minimax rate in arbitrary dimension
when the discrepancy is measured in the L2 metric. That rate is revealed to be
nonparametric and exhibit a prototypical curse of dimensionality. We further
extend Ingster's work to show that the chi-squared test achieves the minimax
rate. Moreover, we show that the test can be made to work when the
distributions have support of low intrinsic dimension. Finally, inspired by
Ingster (2000), we consider a multiscale version of the chi-square test which
can adapt to unknown smoothness and/or unknown intrinsic dimensionality without
much loss in power.Comment: This version comes after the publication of the paper in the Journal
of Nonparametric Statistics. The main change is to cite the work of Ramdas et
al. Some very minor typos were also correcte
Generalized Kernel Two-Sample Tests
Kernel two-sample tests have been widely used for multivariate data in
testing equal distribution. However, existing tests based on mapping
distributions into a reproducing kernel Hilbert space are mainly targeted at
specific alternatives and do not work well for some scenarios when the
dimension of the data is moderate to high due to the curse of dimensionality.
We propose a new test statistic that makes use of a common pattern under
moderate and high dimensions and achieves substantial power improvements over
existing kernel two-sample tests for a wide range of alternatives. We also
propose alternative testing procedures that maintain high power with low
computational cost, offering easy off-the-shelf tools for large datasets. The
new approaches are compared to other state-of-the-art tests under various
settings and show good performance. The new approaches are illustrated on two
applications: The comparison of musks and non-musks using the shape of
molecules, and the comparison of taxi trips started from John F.Kennedy airport
in consecutive months. All proposed methods are implemented in an R package
kerTests
- …