5 research outputs found
Testing and Learning on Distributions with Symmetric Noise Invariance
Kernel embeddings of distributions and the Maximum Mean Discrepancy (MMD),
the resulting distance between distributions, are useful tools for fully
nonparametric two-sample testing and learning on distributions. However, it is
rarely that all possible differences between samples are of interest --
discovered differences can be due to different types of measurement noise, data
collection artefacts or other irrelevant sources of variability. We propose
distances between distributions which encode invariance to additive symmetric
noise, aimed at testing whether the assumed true underlying processes differ.
Moreover, we construct invariant features of distributions, leading to learning
algorithms robust to the impairment of the input distributions with symmetric
additive noise.Comment: 22 page
Two-stage Sampled Learning Theory on Distributions
We focus on the distribution regression problem: regressing to a real-valued
response from a probability distribution. Although there exist a large number
of similarity measures between distributions, very little is known about their
generalization performance in specific learning tasks. Learning problems
formulated on distributions have an inherent two-stage sampled difficulty: in
practice only samples from sampled distributions are observable, and one has to
build an estimate on similarities computed between sets of points. To the best
of our knowledge, the only existing method with consistency guarantees for
distribution regression requires kernel density estimation as an intermediate
step (which suffers from slow convergence issues in high dimensions), and the
domain of the distributions to be compact Euclidean. In this paper, we provide
theoretical guarantees for a remarkably simple algorithmic alternative to solve
the distribution regression problem: embed the distributions to a reproducing
kernel Hilbert space, and learn a ridge regressor from the embeddings to the
outputs. Our main contribution is to prove the consistency of this technique in
the two-stage sampled setting under mild conditions (on separable, topological
domains endowed with kernels). For a given total number of observations, we
derive convergence rates as an explicit function of the problem difficulty. As
a special case, we answer a 15-year-old open question: we establish the
consistency of the classical set kernel [Haussler, 1999; Gartner et. al, 2002]
in regression, and cover more recent kernels on distributions, including those
due to [Christmann and Steinwart, 2010].Comment: v6: accepted at AISTATS-2015 for oral presentation; final version;
code: https://bitbucket.org/szzoli/ite/; extension to the misspecified and
vector-valued case: http://arxiv.org/abs/1411.206
Learning Theory for Distribution Regression
We focus on the distribution regression problem: regressing to vector-valued
outputs from probability measures. Many important machine learning and
statistical tasks fit into this framework, including multi-instance learning
and point estimation problems without analytical solution (such as
hyperparameter or entropy estimation). Despite the large number of available
heuristics in the literature, the inherent two-stage sampled nature of the
problem makes the theoretical analysis quite challenging, since in practice
only samples from sampled distributions are observable, and the estimates have
to rely on similarities computed between sets of points. To the best of our
knowledge, the only existing technique with consistency guarantees for
distribution regression requires kernel density estimation as an intermediate
step (which often performs poorly in practice), and the domain of the
distributions to be compact Euclidean. In this paper, we study a simple,
analytically computable, ridge regression-based alternative to distribution
regression, where we embed the distributions to a reproducing kernel Hilbert
space, and learn the regressor from the embeddings to the outputs. Our main
contribution is to prove that this scheme is consistent in the two-stage
sampled setup under mild conditions (on separable topological domains enriched
with kernels): we present an exact computational-statistical efficiency
trade-off analysis showing that our estimator is able to match the one-stage
sampled minimax optimal rate [Caponnetto and De Vito, 2007; Steinwart et al.,
2009]. This result answers a 17-year-old open question, establishing the
consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002]
in regression. We also cover consistency for more recent kernels on
distributions, including those due to [Christmann and Steinwart, 2010].Comment: Final version appeared at JMLR, with supplement. Code:
https://bitbucket.org/szzoli/ite/. arXiv admin note: text overlap with
arXiv:1402.175
Linear-Time Learning on Distributions with Approximate Kernel Embeddings
Many interesting machine learning problems are best posed by considering instances that are distributions, or sample sets drawn from distributions. Most previous work devoted to machine learning tasks with distributional inputs has done so through pairwise kernel evaluations between pdfs (or sample sets). While such an approach is fine for smaller datasets, the computation of an N × N Gram matrix is prohibitive in large datasets. Recent scalable estimators that work over pdfs have done so only with kernels that use Euclidean metrics, like the L2 distance. However, there are a myriad of other useful metrics available, such as total variation, Hellinger distance, and the Jensen-Shannon divergence. This work develops the first random features for pdfs whose dot product approximates kernels using these non-Euclidean metrics. These random features allow estimators to scale to large datasets by working in a primal space, without computing large Gram matrices. We provide an analysis of the approximation error in using our proposed random features, and show empirically the quality of our approximation both in estimating a Gram matrix and in solving learning tasks in real-world and synthetic data