222,139 research outputs found
Distance Dependent Chinese Restaurant Processes
We develop the distance dependent Chinese restaurant process (CRP), a
flexible class of distributions over partitions that allows for
non-exchangeability. This class can be used to model many kinds of dependencies
between data in infinite clustering models, including dependencies across time
or space. We examine the properties of the distance dependent CRP, discuss its
connections to Bayesian nonparametric mixture models, and derive a Gibbs
sampler for both observed and mixture settings. We study its performance with
three text corpora. We show that relaxing the assumption of exchangeability
with distance dependent CRPs can provide a better fit to sequential data. We
also show its alternative formulation of the traditional CRP leads to a
faster-mixing Gibbs sampling algorithm than the one based on the original
formulation
Bayes and maximum likelihood for -Wasserstein deconvolution of Laplace mixtures
We consider the problem of recovering a distribution function on the real
line from observations additively contaminated with errors following the
standard Laplace distribution. Assuming that the latent distribution is
completely unknown leads to a nonparametric deconvolution problem. We begin by
studying the rates of convergence relative to the -norm and the Hellinger
metric for the direct problem of estimating the sampling density, which is a
mixture of Laplace densities with a possibly unbounded set of locations: the
rate of convergence for the Bayes' density estimator corresponding to a
Dirichlet process prior over the space of all mixing distributions on the real
line matches, up to a logarithmic factor, with the rate
for the maximum likelihood estimator. Then, appealing to an inversion
inequality translating the -norm and the Hellinger distance between
general kernel mixtures, with a kernel density having polynomially decaying
Fourier transform, into any -Wasserstein distance, , between the
corresponding mixing distributions, provided their Laplace transforms are
finite in some neighborhood of zero, we derive the rates of convergence in the
-Wasserstein metric for the Bayes' and maximum likelihood estimators of
the mixing distribution. Merging in the -Wasserstein distance between
Bayes and maximum likelihood follows as a by-product, along with an assessment
on the stochastic order of the discrepancy between the two estimation
procedures
Compact convex sets of the plane and probability theory
The Gauss-Minkowski correspondence in states the existence of
a homeomorphism between the probability measures on such that
and the compact convex sets (CCS) of the plane
with perimeter~1. In this article, we bring out explicit formulas relating the
border of a CCS to its probability measure. As a consequence, we show that some
natural operations on CCS -- for example, the Minkowski sum -- have natural
translations in terms of probability measure operations, and reciprocally, the
convolution of measures translates into a new notion of convolution of CCS.
Additionally, we give a proof that a polygonal curve associated with a sample
of random variables (satisfying ) converges
to a CCS associated with at speed , a result much similar to
the convergence of the empirical process in statistics. Finally, we employ this
correspondence to present models of smooth random CCS and simulations
Basic statistics for probabilistic symbolic variables: a novel metric-based approach
In data mining, it is usually to describe a set of individuals using some
summaries (means, standard deviations, histograms, confidence intervals) that
generalize individual descriptions into a typology description. In this case,
data can be described by several values. In this paper, we propose an approach
for computing basic statics for such data, and, in particular, for data
described by numerical multi-valued variables (interval, histograms, discrete
multi-valued descriptions). We propose to treat all numerical multi-valued
variables as distributional data, i.e. as individuals described by
distributions. To obtain new basic statistics for measuring the variability and
the association between such variables, we extend the classic measure of
inertia, calculated with the Euclidean distance, using the squared Wasserstein
distance defined between probability measures. The distance is a generalization
of the Wasserstein distance, that is a distance between quantile functions of
two distributions. Some properties of such a distance are shown. Among them, we
prove the Huygens theorem of decomposition of the inertia. We show the use of
the Wasserstein distance and of the basic statistics presenting a k-means like
clustering algorithm, for the clustering of a set of data described by modal
numerical variables (distributional variables), on a real data set. Keywords:
Wasserstein distance, inertia, dependence, distributional data, modal
variables.Comment: 19 pages, 3 figure
- …