3,255 research outputs found

    Global and Local Two-Sample Tests via Regression

    Full text link
    Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness

    Two sample tests for high-dimensional covariance matrices

    Get PDF
    We propose two tests for the equality of covariance matrices between two high-dimensional populations. One test is on the whole variance--covariance matrices, and the other is on off-diagonal sub-matrices, which define the covariance between two nonoverlapping segments of the high-dimensional random vectors. The tests are applicable (i) when the data dimension is much larger than the sample sizes, namely the "large pp, small nn" situations and (ii) without assuming parametric distributions for the two populations. These two aspects surpass the capability of the conventional likelihood ratio test. The proposed tests can be used to test on covariances associated with gene ontology terms.Comment: Published in at http://dx.doi.org/10.1214/12-AOS993 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Generalized Kernel Two-Sample Tests

    Full text link
    Kernel two-sample tests have been widely used for multivariate data in testing equal distribution. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space are mainly targeted at specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. The new approaches are illustrated on two applications: The comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips started from John F.Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests
    • …
    corecore