20,974 research outputs found
On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives
Nonparametric two sample testing deals with the question of consistently
deciding if two distributions are different, given samples from both, without
making any parametric assumptions about the form of the distributions. The
current literature is split into two kinds of tests - those which are
consistent without any assumptions about how the distributions may differ
(\textit{general} alternatives), and those which are designed to specifically
test easier alternatives, like a difference in means (\textit{mean-shift}
alternatives).
The main contribution of this paper is to explicitly characterize the power
of a popular nonparametric two sample test, designed for general alternatives,
under a mean-shift alternative in the high-dimensional setting. Specifically,
we explicitly derive the power of the linear-time Maximum Mean Discrepancy
statistic using the Gaussian kernel, where the dimension and sample size can
both tend to infinity at any rate, and the two distributions differ in their
means. As a corollary, we find that if the signal-to-noise ratio is held
constant, then the test's power goes to one if the number of samples increases
faster than the dimension increases. This is the first explicit power
derivation for a general nonparametric test in the high-dimensional setting,
and also the first analysis of how tests designed for general alternatives
perform when faced with easier ones.Comment: 25 pages, 5 figure
Global and Local Two-Sample Tests via Regression
Two-sample testing is a fundamental problem in statistics. Despite its long
history, there has been renewed interest in this problem with the advent of
high-dimensional and complex data. Specifically, in the machine learning
literature, there have been recent methodological developments such as
classification accuracy tests. The goal of this work is to present a regression
approach to comparing multivariate distributions of complex data. Depending on
the chosen regression model, our framework can efficiently handle different
types of variables and various structures in the data, with competitive power
under many practical scenarios. Whereas previous work has been largely limited
to global tests which conceal much of the local information, our approach
naturally leads to a local two-sample testing framework in which we identify
local differences between multivariate distributions with statistical
confidence. We demonstrate the efficacy of our approach both theoretically and
empirically, under some well-known parametric and nonparametric regression
methods. Our proposed methods are applied to simulated data as well as a
challenging astronomy data set to assess their practical usefulness
Fast Two-Sample Testing with Analytic Representations of Probability Measures
We propose a class of nonparametric two-sample tests with a cost linear in
the sample size. Two tests are given, both based on an ensemble of distances
between analytic functions representing each of the distributions. The first
test uses smoothed empirical characteristic functions to represent the
distributions, the second uses distribution embeddings in a reproducing kernel
Hilbert space. Analyticity implies that differences in the distributions may be
detected almost surely at a finite number of randomly chosen
locations/frequencies. The new tests are consistent against a larger class of
alternatives than the previous linear-time tests based on the (non-smoothed)
empirical characteristic functions, while being much faster than the current
state-of-the-art quadratic-time kernel-based or energy distance-based tests.
Experiments on artificial benchmarks and on challenging real-world testing
problems demonstrate that our tests give a better power/time tradeoff than
competing approaches, and in some cases, better outright power than even the
most expensive quadratic-time tests. This performance advantage is retained
even in high dimensions, and in cases where the difference in distributions is
not observable with low order statistics
Generalized spectral tests for the martingale difference hypothesis
^aThis article proposes a test for the Martingale Difference Hypothesis (MDH) using dependence measures related to the characteristic function. The MDH typically has been tested using the sample autocorrelations or in the spectral domain using the periodogram. Tests based on these statistics are inconsistent against uncorrelated non-martingales processes. Here, we generalize the spectral test of Durlauf (1991) for testing the MDH taking into account linear and nonlinear dependence. Our test considers dependence at all lags and is consistent against general pairwise nonparametric Pitman's local alternatives converging at the parametric rate n^(-1/2), with n the sample size. Furthermore, with our methodology there is no need to choose a lag order, to smooth the data or to formulate a parametric alternative. Our approach can be easily extended to specification testing of the conditional mean of possibly nonlinear models. The asymptotic null distribution of our test depends on the data generating process, so a bootstrap procedure is proposed and theoretically justified. Our bootstrap test is robust to higher order dependence, in particular to conditional heteroskedasticity. A Monte Carlo study examines the finite sample performance of our test and shows that it is more powerful than some competing tests. Finally, an application to the S and P 500 stock index and exchange rates highlights the merits of our approach
Generalized spectral tests for the martingale difference hypothesis
This article proposes a test for the martingale difference hypothesis (MDH) using dependence measures related to the characteristic function. The MDH typically has been tested using the sample autocorrelations or in the spectral domain using the periodogram. Tests based on these statistics are inconsistent against uncorrelated non-martingales processes. Here, we generalize the spectral test of Durlauf (1991) for testing the MDH taking into account linear and nonlinear dependence. Our test considers dependence at all lags and is consistent against general pairwise nonparametric Pitman's local alternatives converging at the parametric rate n-1/2, with n the sample size. Furthermore, with our methodology there is no need to choose a lag order, to smooth the data or to formulate a parametric alternative. Our approach could be extended to specification testing of the conditional mean of possibly nonlinear models. The asymptotic null distribution of our test depends on the data generating process, so a bootstrap procedure is proposed and theoretically justified. Our bootstrap test is robust to higher order dependence, in particular to conditional heteroskedasticity. A Monte Carlo study examines the finite sample performance of our test and shows that it is more powerful than some competing tests. Finally, an application to the S&P 500 stock index and exchange rates highlights the merits of our approach.Publicad
- …