10,961 research outputs found
The Exact Equivalence of Distance and Kernel Methods for Hypothesis Testing
Distance-based tests, also called "energy statistics", are leading methods
for two-sample and independence tests from the statistics community.
Kernel-based tests, developed from "kernel mean embeddings", are leading
methods for two-sample and independence tests from the machine learning
community. A fixed-point transformation was previously proposed to connect the
distance methods and kernel methods for the population statistics. In this
paper, we propose a new bijective transformation between metrics and kernels.
It simplifies the fixed-point transformation, inherits similar theoretical
properties, allows distance methods to be exactly the same as kernel methods
for sample statistics and p-value, and better preserves the data structure upon
transformation. Our results further advance the understanding in distance and
kernel-based tests, streamline the code base for implementing these tests, and
enable a rich literature of distance-based and kernel-based methodologies to
directly communicate with each other.Comment: 24 pages main + 7 pages appendix, 3 figure
Training Support Vector Machines Using Frank-Wolfe Optimization Methods
Training a Support Vector Machine (SVM) requires the solution of a quadratic
programming problem (QP) whose computational complexity becomes prohibitively
expensive for large scale datasets. Traditional optimization methods cannot be
directly applied in these cases, mainly due to memory restrictions.
By adopting a slightly different objective function and under mild conditions
on the kernel used within the model, efficient algorithms to train SVMs have
been devised under the name of Core Vector Machines (CVMs). This framework
exploits the equivalence of the resulting learning problem with the task of
building a Minimal Enclosing Ball (MEB) problem in a feature space, where data
is implicitly embedded by a kernel function.
In this paper, we improve on the CVM approach by proposing two novel methods
to build SVMs based on the Frank-Wolfe algorithm, recently revisited as a fast
method to approximate the solution of a MEB problem. In contrast to CVMs, our
algorithms do not require to compute the solutions of a sequence of
increasingly complex QPs and are defined by using only analytic optimization
steps. Experiments on a large collection of datasets show that our methods
scale better than CVMs in most cases, sometimes at the price of a slightly
lower accuracy. As CVMs, the proposed methods can be easily extended to machine
learning problems other than binary classification. However, effective
classifiers are also obtained using kernels which do not satisfy the condition
required by CVMs and can thus be used for a wider set of problems
Continuous testing for Poisson process intensities: A new perspective on scanning statistics
We propose a novel continuous testing framework to test the intensities of
Poisson Processes. This framework allows a rigorous definition of the complete
testing procedure, from an infinite number of hypothesis to joint error rates.
Our work extends traditional procedures based on scanning windows, by
controlling the family-wise error rate and the false discovery rate in a
non-asymptotic manner and in a continuous way. The decision rule is based on a
\pvalue process that can be estimated by a Monte-Carlo procedure. We also
propose new test statistics based on kernels. Our method is applied in
Neurosciences and Genomics through the standard test of homogeneity, and the
two-sample test
Statistical modelling of summary values leads to accurate Approximate Bayesian Computations
Approximate Bayesian Computation (ABC) methods rely on asymptotic arguments,
implying that parameter inference can be systematically biased even when
sufficient statistics are available. We propose to construct the ABC
accept/reject step from decision theoretic arguments on a suitable auxiliary
space. This framework, referred to as ABC*, fully specifies which test
statistics to use, how to combine them, how to set the tolerances and how long
to simulate in order to obtain accuracy properties on the auxiliary space. Akin
to maximum-likelihood indirect inference, regularity conditions establish when
the ABC* approximation to the posterior density is accurate on the original
parameter space in terms of the Kullback-Leibler divergence and the maximum a
posteriori point estimate. Fundamentally, escaping asymptotic arguments
requires knowledge of the distribution of test statistics, which we obtain
through modelling the distribution of summary values, data points on a summary
level. Synthetic examples and an application to time series data of influenza A
(H3N2) infections in the Netherlands illustrate ABC* in action.Comment: Videos can be played with Acrobat Reader. Manuscript under review and
not accepte
- …