20,974 research outputs found

    On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives

    Full text link
    Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives), and those which are designed to specifically test easier alternatives, like a difference in means (\textit{mean-shift} alternatives). The main contribution of this paper is to explicitly characterize the power of a popular nonparametric two sample test, designed for general alternatives, under a mean-shift alternative in the high-dimensional setting. Specifically, we explicitly derive the power of the linear-time Maximum Mean Discrepancy statistic using the Gaussian kernel, where the dimension and sample size can both tend to infinity at any rate, and the two distributions differ in their means. As a corollary, we find that if the signal-to-noise ratio is held constant, then the test's power goes to one if the number of samples increases faster than the dimension increases. This is the first explicit power derivation for a general nonparametric test in the high-dimensional setting, and also the first analysis of how tests designed for general alternatives perform when faced with easier ones.Comment: 25 pages, 5 figure

    Global and Local Two-Sample Tests via Regression

    Full text link
    Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness

    Fast Two-Sample Testing with Analytic Representations of Probability Measures

    Full text link
    We propose a class of nonparametric two-sample tests with a cost linear in the sample size. Two tests are given, both based on an ensemble of distances between analytic functions representing each of the distributions. The first test uses smoothed empirical characteristic functions to represent the distributions, the second uses distribution embeddings in a reproducing kernel Hilbert space. Analyticity implies that differences in the distributions may be detected almost surely at a finite number of randomly chosen locations/frequencies. The new tests are consistent against a larger class of alternatives than the previous linear-time tests based on the (non-smoothed) empirical characteristic functions, while being much faster than the current state-of-the-art quadratic-time kernel-based or energy distance-based tests. Experiments on artificial benchmarks and on challenging real-world testing problems demonstrate that our tests give a better power/time tradeoff than competing approaches, and in some cases, better outright power than even the most expensive quadratic-time tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable with low order statistics

    Generalized spectral tests for the martingale difference hypothesis

    Get PDF
    ^aThis article proposes a test for the Martingale Difference Hypothesis (MDH) using dependence measures related to the characteristic function. The MDH typically has been tested using the sample autocorrelations or in the spectral domain using the periodogram. Tests based on these statistics are inconsistent against uncorrelated non-martingales processes. Here, we generalize the spectral test of Durlauf (1991) for testing the MDH taking into account linear and nonlinear dependence. Our test considers dependence at all lags and is consistent against general pairwise nonparametric Pitman's local alternatives converging at the parametric rate n^(-1/2), with n the sample size. Furthermore, with our methodology there is no need to choose a lag order, to smooth the data or to formulate a parametric alternative. Our approach can be easily extended to specification testing of the conditional mean of possibly nonlinear models. The asymptotic null distribution of our test depends on the data generating process, so a bootstrap procedure is proposed and theoretically justified. Our bootstrap test is robust to higher order dependence, in particular to conditional heteroskedasticity. A Monte Carlo study examines the finite sample performance of our test and shows that it is more powerful than some competing tests. Finally, an application to the S and P 500 stock index and exchange rates highlights the merits of our approach

    Generalized spectral tests for the martingale difference hypothesis

    Get PDF
    This article proposes a test for the martingale difference hypothesis (MDH) using dependence measures related to the characteristic function. The MDH typically has been tested using the sample autocorrelations or in the spectral domain using the periodogram. Tests based on these statistics are inconsistent against uncorrelated non-martingales processes. Here, we generalize the spectral test of Durlauf (1991) for testing the MDH taking into account linear and nonlinear dependence. Our test considers dependence at all lags and is consistent against general pairwise nonparametric Pitman's local alternatives converging at the parametric rate n-1/2, with n the sample size. Furthermore, with our methodology there is no need to choose a lag order, to smooth the data or to formulate a parametric alternative. Our approach could be extended to specification testing of the conditional mean of possibly nonlinear models. The asymptotic null distribution of our test depends on the data generating process, so a bootstrap procedure is proposed and theoretically justified. Our bootstrap test is robust to higher order dependence, in particular to conditional heteroskedasticity. A Monte Carlo study examines the finite sample performance of our test and shows that it is more powerful than some competing tests. Finally, an application to the S&P 500 stock index and exchange rates highlights the merits of our approach.Publicad
    • …
    corecore