330,575 research outputs found
Tests based on U-statistics
U-statistiky jsou základem mnoha testových statistik. Táto práce se zabývá testy založenými na U-statistikách pro obecný dvouvýběrový problém. Po popisu dvouvýběrových testů založených na U-statistikách z obecného hlediska, následuje přehled několika konkrétních testů. Všechny uvažované testové statistiky jsou popsány v návaznosti na teorii U-statistik, s důrazem na jejich asymptotické vlastnosti. Zvláštní pozornost je věnována testům založeným na rozdílu dvou jednovýběrových statistik a testům založeným na empirických charakteristických funkcích. Testové statistiky, založené na rozdílu dvou empirických charakteristických funkcí, mají tvar V-statistiky. Od nich je odvozena příbuzná U-statistika a jsou studovány její vlastnosti.There are many test statistics that are based on U-statistics. This thesis deals with tests based on U-statistics for the general two-sample problem. After describing two-sample tests based on U-statistics from a general viewpoint, a presentation of some particular test statistics follows. All considered test statistics are described in a connection to the theory of U-statistics, with emphasis on their asymptotic properties. Special concern is given to tests based on a difference between two one-sample U-statistics and to tests based on empirical characteristic functions. Test statistics, based on a difference of two empirical characteristic functions, have a form of a V -statistic. A related U-statistic is derived and its properties are studied. An example of applying the bootstrap method to these test statistics is included.Katedra pravděpodobnosti a matematické statistikyDepartment of Probability and Mathematical StatisticsFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult
Fast Two-Sample Testing with Analytic Representations of Probability Measures
We propose a class of nonparametric two-sample tests with a cost linear in
the sample size. Two tests are given, both based on an ensemble of distances
between analytic functions representing each of the distributions. The first
test uses smoothed empirical characteristic functions to represent the
distributions, the second uses distribution embeddings in a reproducing kernel
Hilbert space. Analyticity implies that differences in the distributions may be
detected almost surely at a finite number of randomly chosen
locations/frequencies. The new tests are consistent against a larger class of
alternatives than the previous linear-time tests based on the (non-smoothed)
empirical characteristic functions, while being much faster than the current
state-of-the-art quadratic-time kernel-based or energy distance-based tests.
Experiments on artificial benchmarks and on challenging real-world testing
problems demonstrate that our tests give a better power/time tradeoff than
competing approaches, and in some cases, better outright power than even the
most expensive quadratic-time tests. This performance advantage is retained
even in high dimensions, and in cases where the difference in distributions is
not observable with low order statistics
Testing and Learning on Distributions with Symmetric Noise Invariance
Kernel embeddings of distributions and the Maximum Mean Discrepancy (MMD),
the resulting distance between distributions, are useful tools for fully
nonparametric two-sample testing and learning on distributions. However, it is
rarely that all possible differences between samples are of interest --
discovered differences can be due to different types of measurement noise, data
collection artefacts or other irrelevant sources of variability. We propose
distances between distributions which encode invariance to additive symmetric
noise, aimed at testing whether the assumed true underlying processes differ.
Moreover, we construct invariant features of distributions, leading to learning
algorithms robust to the impairment of the input distributions with symmetric
additive noise.Comment: 22 page
Interpretable Distribution Features with Maximum Testing Power
Two semimetrics on probability distributions are proposed, given as the sum
of differences of expectations of analytic functions evaluated at spatial or
frequency locations (i.e, features). The features are chosen so as to maximize
the distinguishability of the distributions, by optimizing a lower bound on
test power for a statistical test using these features. The result is a
parsimonious and interpretable indication of how and where two distributions
differ locally. An empirical estimate of the test power criterion converges
with increasing sample size, ensuring the quality of the returned features. In
real-world benchmarks on high-dimensional text and image data, linear-time
tests using the proposed semimetrics achieve comparable performance to the
state-of-the-art quadratic-time maximum mean discrepancy test, while returning
human-interpretable features that explain the test results
A nonparametric two-sample hypothesis testing problem for random dot product graphs
We consider the problem of testing whether two finite-dimensional random dot
product graphs have generating latent positions that are independently drawn
from the same distribution, or distributions that are related via scaling or
projection. We propose a test statistic that is a kernel-based function of the
adjacency spectral embedding for each graph. We obtain a limiting distribution
for our test statistic under the null and we show that our test procedure is
consistent across a broad range of alternatives.Comment: 24 pages, 1 figure
Generalized spectral tests for the martingale difference hypothesis
^aThis article proposes a test for the Martingale Difference Hypothesis (MDH) using dependence measures related to the characteristic function. The MDH typically has been tested using the sample autocorrelations or in the spectral domain using the periodogram. Tests based on these statistics are inconsistent against uncorrelated non-martingales processes. Here, we generalize the spectral test of Durlauf (1991) for testing the MDH taking into account linear and nonlinear dependence. Our test considers dependence at all lags and is consistent against general pairwise nonparametric Pitman's local alternatives converging at the parametric rate n^(-1/2), with n the sample size. Furthermore, with our methodology there is no need to choose a lag order, to smooth the data or to formulate a parametric alternative. Our approach can be easily extended to specification testing of the conditional mean of possibly nonlinear models. The asymptotic null distribution of our test depends on the data generating process, so a bootstrap procedure is proposed and theoretically justified. Our bootstrap test is robust to higher order dependence, in particular to conditional heteroskedasticity. A Monte Carlo study examines the finite sample performance of our test and shows that it is more powerful than some competing tests. Finally, an application to the S and P 500 stock index and exchange rates highlights the merits of our approach
- …