998 research outputs found
Statistical topological data analysis using persistence landscapes
We define a new topological summary for data that we call the persistence
landscape. Since this summary lies in a vector space, it is easy to combine
with tools from statistics and machine learning, in contrast to the standard
topological summaries. Viewed as a random variable with values in a Banach
space, this summary obeys a strong law of large numbers and a central limit
theorem. We show how a number of standard statistical tests can be used for
statistical inference using this summary. We also prove that this summary is
stable and that it can be used to provide lower bounds for the bottleneck and
Wasserstein distances.Comment: 26 pages, final version, to appear in Journal of Machine Learning
Research, includes two additional examples not in the journal version: random
geometric complexes and Erdos-Renyi random clique complexe
Multiple testing with persistent homology
Multiple hypothesis testing requires a control procedure. Simply increasing
simulations or permutations to meet a Bonferroni-style threshold is
prohibitively expensive. In this paper we propose a null model based approach
to testing for acyclicity, coupled with a Family-Wise Error Rate (FWER) control
method that does not suffer from these computational costs. We adapt an False
Discovery Rate (FDR) control approach to the topological setting, and show it
to be compatible both with our null model approach and with previous approaches
to hypothesis testing in persistent homology. By extending a limit theorem for
persistent homology on samples from point processes, we provide theoretical
validation for our FWER and FDR control methods
Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study
In this paper, we propose a method for measuring the similarity low sample
tabular data with synthetically generated data with a larger number of samples
than original. This process is also known as data augmentation. But
significance levels obtained from non-parametric tests are suspect when sample
size is small. Our method uses a combination of geometry, topology and robust
statistics for hypothesis testing in order to compare the validity of generated
data. We also compare the results with common global metric methods available
in the literature for large sample size data
The persistence landscape and some of its properties
Persistence landscapes map persistence diagrams into a function space, which
may often be taken to be a Banach space or even a Hilbert space. In the latter
case, it is a feature map and there is an associated kernel. The main advantage
of this summary is that it allows one to apply tools from statistics and
machine learning. Furthermore, the mapping from persistence diagrams to
persistence landscapes is stable and invertible. We introduce a weighted
version of the persistence landscape and define a one-parameter family of
Poisson-weighted persistence landscape kernels that may be useful for learning.
We also demonstrate some additional properties of the persistence landscape.
First, the persistence landscape may be viewed as a tropical rational function.
Second, in many cases it is possible to exactly reconstruct all of the
component persistence diagrams from an average persistence landscape. It
follows that the persistence landscape kernel is characteristic for certain
generic empirical measures. Finally, the persistence landscape distance may be
arbitrarily small compared to the interleaving distance.Comment: 18 pages, to appear in the Proceedings of the 2018 Abel Symposiu
Subsampling Methods for Persistent Homology
Persistent homology is a multiscale method for analyzing the shape of sets
and functions from point cloud data arising from an unknown distribution
supported on those sets. When the size of the sample is large, direct
computation of the persistent homology is prohibitive due to the combinatorial
nature of the existing algorithms. We propose to compute the persistent
homology of several subsamples of the data and then combine the resulting
estimates. We study the risk of two estimators and we prove that the
subsampling approach carries stable topological information while achieving a
great reduction in computational complexity
Interpretable statistics for complex modelling: quantile and topological learning
As the complexity of our data increased exponentially in the last decades, so has our
need for interpretable features. This thesis revolves around two paradigms to approach
this quest for insights.
In the first part we focus on parametric models, where the problem of interpretability
can be seen as a “parametrization selection”. We introduce a quantile-centric
parametrization and we show the advantages of our proposal in the context of regression,
where it allows to bridge the gap between classical generalized linear (mixed)
models and increasingly popular quantile methods.
The second part of the thesis, concerned with topological learning, tackles the
problem from a non-parametric perspective. As topology can be thought of as a way
of characterizing data in terms of their connectivity structure, it allows to represent
complex and possibly high dimensional through few features, such as the number of
connected components, loops and voids. We illustrate how the emerging branch of
statistics devoted to recovering topological structures in the data, Topological Data
Analysis, can be exploited both for exploratory and inferential purposes with a special
emphasis on kernels that preserve the topological information in the data.
Finally, we show with an application how these two approaches can borrow strength
from one another in the identification and description of brain activity through fMRI
data from the ABIDE project
- …