251 research outputs found

    Global and Local Two-Sample Tests via Regression

    Full text link
    Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness

    Remember the Curse of Dimensionality: The Case of Goodness-of-Fit Testing in Arbitrary Dimension

    Full text link
    Despite a substantial literature on nonparametric two-sample goodness-of-fit testing in arbitrary dimensions spanning decades, there is no mention there of any curse of dimensionality. Only more recently Ramdas et al. (2015) have discussed this issue in the context of kernel methods by showing that their performance degrades with the dimension even when the underlying distributions are isotropic Gaussians. We take a minimax perspective and follow in the footsteps of Ingster (1987) to derive the minimax rate in arbitrary dimension when the discrepancy is measured in the L2 metric. That rate is revealed to be nonparametric and exhibit a prototypical curse of dimensionality. We further extend Ingster's work to show that the chi-squared test achieves the minimax rate. Moreover, we show that the test can be made to work when the distributions have support of low intrinsic dimension. Finally, inspired by Ingster (2000), we consider a multiscale version of the chi-square test which can adapt to unknown smoothness and/or unknown intrinsic dimensionality without much loss in power.Comment: This version comes after the publication of the paper in the Journal of Nonparametric Statistics. The main change is to cite the work of Ramdas et al. Some very minor typos were also correcte

    Generalized Kernel Two-Sample Tests

    Full text link
    Kernel two-sample tests have been widely used for multivariate data in testing equal distribution. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space are mainly targeted at specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. The new approaches are illustrated on two applications: The comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips started from John F.Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests

    Adaptive Sampling in Particle Image Velocimetry

    Get PDF
    • …
    corecore