86 research outputs found
Recommended from our members
Statistical Learning in Wasserstein Space
We seek a generalization of regression and principle component analysis (PCA) in a metric space where data points are distributions metrized by the Wasserstein metric. We recast these analyses as multimarginal optimal transport problems. The particular formulation allows efficient computation, ensures existence of optimal solutions, and admits a probabilistic interpretation over the space of paths (line segments). Application of the theory to the interpolation of empirical distributions, images, power spectra, as well as assessing uncertainty in experimental designs, is envisioned
Projected Statistical Methods for Distributional Data on the Real Line with the Wasserstein Metric
We present a novel class of projected methods, to perform statistical
analysis on a data set of probability distributions on the real line, with the
2-Wasserstein metric. We focus in particular on Principal Component Analysis
(PCA) and regression. To define these models, we exploit a representation of
the Wasserstein space closely related to its weak Riemannian structure, by
mapping the data to a suitable linear space and using a metric projection
operator to constrain the results in the Wasserstein space. By carefully
choosing the tangent point, we are able to derive fast empirical methods,
exploiting a constrained B-spline approximation. As a byproduct of our
approach, we are also able to derive faster routines for previous work on PCA
for distributions. By means of simulation studies, we compare our approaches to
previously proposed methods, showing that our projected PCA has similar
performance for a fraction of the computational cost and that the projected
regression is extremely flexible even under misspecification. Several
theoretical properties of the models are investigated and asymptotic
consistency is proven. Two real world applications to Covid-19 mortality in the
US and wind speed forecasting are discussed
Principal Geodesic Analysis of Merge Trees (and Persistence Diagrams)
This paper presents a computational framework for the Principal Geodesic
Analysis of merge trees (MT-PGA), a novel adaptation of the celebrated
Principal Component Analysis (PCA) framework [87] to the Wasserstein metric
space of merge trees [92]. We formulate MT-PGA computation as a constrained
optimization problem, aiming at adjusting a basis of orthogonal geodesic axes,
while minimizing a fitting energy. We introduce an efficient, iterative
algorithm which exploits shared-memory parallelism, as well as an analytic
expression of the fitting energy gradient, to ensure fast iterations. Our
approach also trivially extends to extremum persistence diagrams. Extensive
experiments on public ensembles demonstrate the efficiency of our approach -
with MT-PGA computations in the orders of minutes for the largest examples. We
show the utility of our contributions by extending to merge trees two typical
PCA applications. First, we apply MT-PGA to data reduction and reliably
compress merge trees by concisely representing them by their first coordinates
in the MT-PGA basis. Second, we present a dimensionality reduction framework
exploiting the first two directions of the MT-PGA basis to generate
two-dimensional layouts of the ensemble. We augment these layouts with
persistence correlation views, enabling global and local visual inspections of
the feature variability in the ensemble. In both applications, quantitative
experiments assess the relevance of our framework. Finally, we provide a
lightweight C++ implementation that can be used to reproduce our results
Efficient Convex PCA with applications to Wasserstein geodesic PCA and ranked data
Convex PCA, which was introduced by Bigot et al., is a dimension reduction
methodology for data with values in a convex subset of a Hilbert space. This
setting arises naturally in many applications, including distributional data in
the Wasserstein space of an interval, and ranked compositional data under the
Aitchison geometry. Our contribution in this paper is threefold. First, we
present several new theoretical results including consistency as well as
continuity and differentiability of the objective function in the finite
dimensional case. Second, we develop a numerical implementation of finite
dimensional convex PCA when the convex set is polyhedral, and show that this
provides a natural approximation of Wasserstein geodesic PCA. Third, we
illustrate our results with two financial applications, namely distributions of
stock returns ranked by size and the capital distribution curve, both of which
are of independent interest in stochastic portfolio theory.Comment: 40 pages, 9 figure
Wasserstein Regression
The analysis of samples of random objects that do not lie in a vector space
is gaining increasing attention in statistics. An important class of such
object data is univariate probability measures defined on the real line.
Adopting the Wasserstein metric, we develop a class of regression models for
such data, where random distributions serve as predictors and the responses are
either also distributions or scalars. To define this regression model, we
utilize the geometry of tangent bundles of the space of random measures endowed
with the Wasserstein metric for mapping distributions to tangent spaces. The
proposed distribution-to-distribution regression model provides an extension of
multivariate linear regression for Euclidean data and function-to-function
regression for Hilbert space valued data in functional data analysis. In
simulations, it performs better than an alternative transformation approach
where one maps distributions to a Hilbert space through the log quantile
density transformation and then applies traditional functional regression. We
derive asymptotic rates of convergence for the estimator of the regression
operator and for predicted distributions and also study an extension to
autoregressive models for distribution-valued time series. The proposed methods
are illustrated with data on human mortality and distributional time series of
house prices
Long-time principal geodesic analysis in director-based dynamics of hybrid mechanical systems
In this article, we investigate an extended version of principal geodesic analysis for the unit sphere S2 and the special orthogonal group SO(3). In contrast to prior work, we address the construction of long-time smooth lifts of possibly non-localized data across branches of the respective logarithm maps. To this end, we pay special attention to certain critical numerical aspects such as singularities and their consequences on the numerical accuracy. Moreover, we apply principal geodesic analysis to investigate the behavior of several mechanical systems that are very rich in dynamics. The examples chosen are computationally modeled by employing a director-based formulation for rigid and flexible mechanical systems. Such a formulation allows to investigate our algorithms in a direct manner while avoiding the introduction of additional sources of error that are unrelated to principal geodesic analysis. Finally, we test our numerical machinery with the examples and, at the same time, we gain deeper insight into their dynamical behavior
Regularized Optimal Transport and the Rot Mover's Distance
This paper presents a unified framework for smooth convex regularization of
discrete optimal transport problems. In this context, the regularized optimal
transport turns out to be equivalent to a matrix nearness problem with respect
to Bregman divergences. Our framework thus naturally generalizes a previously
proposed regularization based on the Boltzmann-Shannon entropy related to the
Kullback-Leibler divergence, and solved with the Sinkhorn-Knopp algorithm. We
call the regularized optimal transport distance the rot mover's distance in
reference to the classical earth mover's distance. We develop two generic
schemes that we respectively call the alternate scaling algorithm and the
non-negative alternate scaling algorithm, to compute efficiently the
regularized optimal plans depending on whether the domain of the regularizer
lies within the non-negative orthant or not. These schemes are based on
Dykstra's algorithm with alternate Bregman projections, and further exploit the
Newton-Raphson method when applied to separable divergences. We enhance the
separable case with a sparse extension to deal with high data dimensions. We
also instantiate our proposed framework and discuss the inherent specificities
for well-known regularizers and statistical divergences in the machine learning
and information geometry communities. Finally, we demonstrate the merits of our
methods with experiments using synthetic data to illustrate the effect of
different regularizers and penalties on the solutions, as well as real-world
data for a pattern recognition application to audio scene classification
Statistical learning of random probability measures
The study of random probability measures is a lively research topic that has
attracted interest from different fields in recent years. In this thesis, we consider
random probability measures in the context of Bayesian nonparametrics,
where the law of a random probability measure is used as prior distribution,
and in the context of distributional data analysis, where
the goal is to perform inference given avsample from the law of a random probability measure.
The contributions contained in this thesis can be subdivided according to three
different topics: (i) the use of almost surely discrete repulsive random measures
(i.e., whose support points are well separated) for Bayesian model-based
clustering, (ii) the proposal of new laws for collections of random probability
measures for Bayesian density estimation of partially
exchangeable data subdivided into different groups, and (iii) the study
of principal component analysis and regression models for probability distributions
seen as elements of the 2-Wasserstein space. Specifically, for point
(i) above we propose an efficient Markov chain Monte Carlo algorithm for
posterior inference, which sidesteps the need of split-merge reversible jump
moves typically associated with poor performance, we propose a model for
clustering high-dimensional data by introducing a novel class of anisotropic
determinantal point processes, and study the distributional properties of the
repulsive measures, shedding light on important theoretical results which enable
more principled prior elicitation and more efficient posterior simulation
algorithms. For point (ii) above, we consider several models suitable for clustering
homogeneous populations, inducing spatial dependence across groups of
data, extracting the characteristic traits common to all the data-groups, and
propose a novel vector autoregressive model to study of growth
curves of Singaporean kids. Finally, for point (iii), we propose a novel class of
projected statistical methods for distributional data analysis for measures
on the real line and on the unit-circle
- …