3 research outputs found
Non-I.I.D. Multi-Instance Dimensionality Reduction by Learning a Maximum Bag Margin Subspace
Multi-instance learning, as other machine learning tasks, also suffers from the curse of dimensionality. Although dimensionality reduction methods have been investigated for many years, multi-instance dimensionality reduction methods remain untouched. On the other hand, most algorithms in multi- instance framework treat instances in each bag as independently and identically distributed samples, which fails to utilize the structure information conveyed by instances in a bag. In this paper, we propose a multi-instance dimensionality reduction method, which treats instances in each bag as non-i.i.d. samples. We regard every bag as a whole entity and define a bag margin objective function. By maximizing the margin of positive and negative bags, we learn a subspace to obtain more salient representation of original data. Experiments demonstrate the effectiveness of the proposed method
Two-stage Sampled Learning Theory on Distributions
We focus on the distribution regression problem: regressing to a real-valued
response from a probability distribution. Although there exist a large number
of similarity measures between distributions, very little is known about their
generalization performance in specific learning tasks. Learning problems
formulated on distributions have an inherent two-stage sampled difficulty: in
practice only samples from sampled distributions are observable, and one has to
build an estimate on similarities computed between sets of points. To the best
of our knowledge, the only existing method with consistency guarantees for
distribution regression requires kernel density estimation as an intermediate
step (which suffers from slow convergence issues in high dimensions), and the
domain of the distributions to be compact Euclidean. In this paper, we provide
theoretical guarantees for a remarkably simple algorithmic alternative to solve
the distribution regression problem: embed the distributions to a reproducing
kernel Hilbert space, and learn a ridge regressor from the embeddings to the
outputs. Our main contribution is to prove the consistency of this technique in
the two-stage sampled setting under mild conditions (on separable, topological
domains endowed with kernels). For a given total number of observations, we
derive convergence rates as an explicit function of the problem difficulty. As
a special case, we answer a 15-year-old open question: we establish the
consistency of the classical set kernel [Haussler, 1999; Gartner et. al, 2002]
in regression, and cover more recent kernels on distributions, including those
due to [Christmann and Steinwart, 2010].Comment: v6: accepted at AISTATS-2015 for oral presentation; final version;
code: https://bitbucket.org/szzoli/ite/; extension to the misspecified and
vector-valued case: http://arxiv.org/abs/1411.206
Learning Theory for Distribution Regression
We focus on the distribution regression problem: regressing to vector-valued
outputs from probability measures. Many important machine learning and
statistical tasks fit into this framework, including multi-instance learning
and point estimation problems without analytical solution (such as
hyperparameter or entropy estimation). Despite the large number of available
heuristics in the literature, the inherent two-stage sampled nature of the
problem makes the theoretical analysis quite challenging, since in practice
only samples from sampled distributions are observable, and the estimates have
to rely on similarities computed between sets of points. To the best of our
knowledge, the only existing technique with consistency guarantees for
distribution regression requires kernel density estimation as an intermediate
step (which often performs poorly in practice), and the domain of the
distributions to be compact Euclidean. In this paper, we study a simple,
analytically computable, ridge regression-based alternative to distribution
regression, where we embed the distributions to a reproducing kernel Hilbert
space, and learn the regressor from the embeddings to the outputs. Our main
contribution is to prove that this scheme is consistent in the two-stage
sampled setup under mild conditions (on separable topological domains enriched
with kernels): we present an exact computational-statistical efficiency
trade-off analysis showing that our estimator is able to match the one-stage
sampled minimax optimal rate [Caponnetto and De Vito, 2007; Steinwart et al.,
2009]. This result answers a 17-year-old open question, establishing the
consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002]
in regression. We also cover consistency for more recent kernels on
distributions, including those due to [Christmann and Steinwart, 2010].Comment: Final version appeared at JMLR, with supplement. Code:
https://bitbucket.org/szzoli/ite/. arXiv admin note: text overlap with
arXiv:1402.175