3,255 research outputs found
Global and Local Two-Sample Tests via Regression
Two-sample testing is a fundamental problem in statistics. Despite its long
history, there has been renewed interest in this problem with the advent of
high-dimensional and complex data. Specifically, in the machine learning
literature, there have been recent methodological developments such as
classification accuracy tests. The goal of this work is to present a regression
approach to comparing multivariate distributions of complex data. Depending on
the chosen regression model, our framework can efficiently handle different
types of variables and various structures in the data, with competitive power
under many practical scenarios. Whereas previous work has been largely limited
to global tests which conceal much of the local information, our approach
naturally leads to a local two-sample testing framework in which we identify
local differences between multivariate distributions with statistical
confidence. We demonstrate the efficacy of our approach both theoretically and
empirically, under some well-known parametric and nonparametric regression
methods. Our proposed methods are applied to simulated data as well as a
challenging astronomy data set to assess their practical usefulness
Two sample tests for high-dimensional covariance matrices
We propose two tests for the equality of covariance matrices between two
high-dimensional populations. One test is on the whole variance--covariance
matrices, and the other is on off-diagonal sub-matrices, which define the
covariance between two nonoverlapping segments of the high-dimensional random
vectors. The tests are applicable (i) when the data dimension is much larger
than the sample sizes, namely the "large , small " situations and (ii)
without assuming parametric distributions for the two populations. These two
aspects surpass the capability of the conventional likelihood ratio test. The
proposed tests can be used to test on covariances associated with gene ontology
terms.Comment: Published in at http://dx.doi.org/10.1214/12-AOS993 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Generalized Kernel Two-Sample Tests
Kernel two-sample tests have been widely used for multivariate data in
testing equal distribution. However, existing tests based on mapping
distributions into a reproducing kernel Hilbert space are mainly targeted at
specific alternatives and do not work well for some scenarios when the
dimension of the data is moderate to high due to the curse of dimensionality.
We propose a new test statistic that makes use of a common pattern under
moderate and high dimensions and achieves substantial power improvements over
existing kernel two-sample tests for a wide range of alternatives. We also
propose alternative testing procedures that maintain high power with low
computational cost, offering easy off-the-shelf tools for large datasets. The
new approaches are compared to other state-of-the-art tests under various
settings and show good performance. The new approaches are illustrated on two
applications: The comparison of musks and non-musks using the shape of
molecules, and the comparison of taxi trips started from John F.Kennedy airport
in consecutive months. All proposed methods are implemented in an R package
kerTests
- …