477 research outputs found
Efficient Classification for Metric Data
Recent advances in large-margin classification of data residing in general
metric spaces (rather than Hilbert spaces) enable classification under various
natural metrics, such as string edit and earthmover distance. A general
framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004]
left open the questions of computational efficiency and of providing direct
bounds on generalization error.
We design a new algorithm for classification in general metric spaces, whose
runtime and accuracy depend on the doubling dimension of the data points, and
can thus achieve superior classification performance in many common scenarios.
The algorithmic core of our approach is an approximate (rather than exact)
solution to the classical problems of Lipschitz extension and of Nearest
Neighbor Search. The algorithm's generalization performance is guaranteed via
the fat-shattering dimension of Lipschitz classifiers, and we present
experimental evidence of its superiority to some common kernel methods. As a
by-product, we offer a new perspective on the nearest neighbor classifier,
which yields significantly sharper risk asymptotics than the classic analysis
of Cover and Hart [IEEE Trans. Info. Theory, 1967].Comment: This is the full version of an extended abstract that appeared in
Proceedings of the 23rd COLT, 201
Weighted dependency graphs
The theory of dependency graphs is a powerful toolbox to prove asymptotic
normality of sums of random variables. In this article, we introduce a more
general notion of weighted dependency graphs and give normality criteria in
this context. We also provide generic tools to prove that some weighted graph
is a weighted dependency graph for a given family of random variables.
To illustrate the power of the theory, we give applications to the following
objects: uniform random pair partitions, the random graph model ,
uniform random permutations, the symmetric simple exclusion process and
multilinear statistics on Markov chains. The application to random permutations
gives a bivariate extension of a functional central limit theorem of Janson and
Barbour. On Markov chains, we answer positively an open question of Bourdon and
Vall\'ee on the asymptotic normality of subword counts in random texts
generated by a Markovian source.Comment: 57 pages. Third version: minor modifications, after review proces
APPROXIMATION ALGORITHMS FOR POINT PATTERN MATCHING AND SEARCHI NG
Point pattern matching is a fundamental problem in computational geometry.
For given a reference set and pattern set, the problem is to find a
geometric transformation applied to the pattern set that minimizes some
given distance measure with respect to the reference set. This problem has
been heavily researched under various distance measures and error models.
Point set similarity searching is variation of this problem in which a
large database of point sets is given, and the task is to preprocess
this database into a data structure so that, given a query point set,
it is possible to rapidly find the nearest point set among elements of
the database. Here, the term nearest is understood in
above sense of pattern matching, where the elements of the database may be
transformed to match the given query set. The approach presented here is
to compute a low distortion embedding of the pattern matching problem into
an (ideally) low dimensional metric space and then apply any standard
algorithm for nearest neighbor searching over this metric space.
This main focus of this dissertation is on two problems
in the area of point pattern matching and searching algorithms:
(i) improving the accuracy of alignment-based point pattern matching and
(ii) computing low-distortion embeddings of point sets into vector spaces.
For the first problem, new methods are presented for matching point sets
based on alignments of small subsets of points. It is shown that these methods
lead to better approximation bounds for alignment-based planar point pattern
matching algorithms under the Hausdorff distance. Furthermore, it is shown
that these approximation bounds are nearly the best achievable by alignment-based
methods.
For the second problem, results are presented for two different distance
measures. First, point pattern similarity search under translation for point sets
in multidimensional integer space is considered, where the distance function is
the symmetric difference. A randomized embedding into real space under the L1
metric is given. The algorithm achieves an expected distortion of O(log2 n).
Second, an algorithm is given for embedding Rd under the Earth Mover's
Distance (EMD) into multidimensional integer space under the symmetric difference
distance. This embedding achieves a distortion of O(log D), where D is
the diameter of the point set. Combining this with the above result implies that
point pattern similarity search with translation under the EMD can be embedded in
to
real space in the L1 metric with an expected distortion of O(log2 n log D)
Euler Characteristic Tools For Topological Data Analysis
In this article, we study Euler characteristic techniques in topological data
analysis. Pointwise computing the Euler characteristic of a family of
simplicial complexes built from data gives rise to the so-called Euler
characteristic profile. We show that this simple descriptor achieve
state-of-the-art performance in supervised tasks at a very low computational
cost. Inspired by signal analysis, we compute hybrid transforms of Euler
characteristic profiles. These integral transforms mix Euler characteristic
techniques with Lebesgue integration to provide highly efficient compressors of
topological signals. As a consequence, they show remarkable performances in
unsupervised settings. On the qualitative side, we provide numerous heuristics
on the topological and geometric information captured by Euler profiles and
their hybrid transforms. Finally, we prove stability results for these
descriptors as well as asymptotic guarantees in random settings.Comment: 39 page
Stable Vectorization of Multiparameter Persistent Homology using Signed Barcodes as Measures
Persistent homology (PH) provides topological descriptors for geometric data,
such as weighted graphs, which are interpretable, stable to perturbations, and
invariant under, e.g., relabeling. Most applications of PH focus on the
one-parameter case -- where the descriptors summarize the changes in topology
of data as it is filtered by a single quantity of interest -- and there is now
a wide array of methods enabling the use of one-parameter PH descriptors in
data science, which rely on the stable vectorization of these descriptors as
elements of a Hilbert space. Although the multiparameter PH (MPH) of data that
is filtered by several quantities of interest encodes much richer information
than its one-parameter counterpart, the scarceness of stability results for MPH
descriptors has so far limited the available options for the stable
vectorization of MPH. In this paper, we aim to bring together the best of both
worlds by showing how the interpretation of signed barcodes -- a recent family
of MPH descriptors -- as signed measures leads to natural extensions of
vectorization strategies from one parameter to multiple parameters. The
resulting feature vectors are easy to define and to compute, and provably
stable. While, as a proof of concept, we focus on simple choices of signed
barcodes and vectorizations, we already see notable performance improvements
when comparing our feature vectors to state-of-the-art topology-based methods
on various types of data.Comment: 23 pages, 3 figures, 8 table
- …