205 research outputs found
Inverse Density as an Inverse Problem: The Fredholm Equation Approach
In this paper we address the problem of estimating the ratio
where is a density function and is another density, or, more generally
an arbitrary function. Knowing or approximating this ratio is needed in various
problems of inference and integration, in particular, when one needs to average
a function with respect to one probability distribution, given a sample from
another. It is often referred as {\it importance sampling} in statistical
inference and is also closely related to the problem of {\it covariate shift}
in transfer learning as well as to various MCMC methods. It may also be useful
for separating the underlying geometry of a space, say a manifold, from the
density function defined on it.
Our approach is based on reformulating the problem of estimating
as an inverse problem in terms of an integral operator
corresponding to a kernel, and thus reducing it to an integral equation, known
as the Fredholm problem of the first kind. This formulation, combined with the
techniques of regularization and kernel methods, leads to a principled
kernel-based framework for constructing algorithms and for analyzing them
theoretically.
The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized
Estimator) is flexible, simple and easy to implement.
We provide detailed theoretical analysis including concentration bounds and
convergence rates for the Gaussian kernel in the case of densities defined on
, compact domains in and smooth -dimensional sub-manifolds of
the Euclidean space.
We also show experimental results including applications to classification
and semi-supervised learning within the covariate shift framework and
demonstrate some encouraging experimental comparisons. We also show how the
parameters of our algorithms can be chosen in a completely unsupervised manner.Comment: Fixing a few typos in last versio
Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering
Hierarchical clustering is a popular method for analyzing data which
associates a tree to a dataset. Hartigan consistency has been used extensively
as a framework to analyze such clustering algorithms from a statistical point
of view. Still, as we show in the paper, a tree which is Hartigan consistent
with a given density can look very different than the correct limit tree.
Specifically, Hartigan consistency permits two types of undesirable
configurations which we term over-segmentation and improper nesting. Moreover,
Hartigan consistency is a limit property and does not directly quantify
difference between trees.
In this paper we identify two limit properties, separation and minimality,
which address both over-segmentation and improper nesting and together imply
(but are not implied by) Hartigan consistency. We proceed to introduce a merge
distortion metric between hierarchical clusterings and show that convergence in
our distance implies both separation and minimality. We also prove that uniform
separation and minimality imply convergence in the merge distortion metric.
Furthermore, we show that our merge distortion metric is stable under
perturbations of the density.
Finally, we demonstrate applicability of these concepts by proving
convergence results for two clustering algorithms. First, we show convergence
(and hence separation and minimality) of the recent robust single linkage
algorithm of Chaudhuri and Dasgupta (2010). Second, we provide convergence
results on manifolds for topological split tree clustering
- β¦