480 research outputs found
Operator norm convergence of spectral clustering on level sets
Following Hartigan, a cluster is defined as a connected component of the
t-level set of the underlying density, i.e., the set of points for which the
density is greater than t. A clustering algorithm which combines a density
estimate with spectral clustering techniques is proposed. Our algorithm is
composed of two steps. First, a nonparametric density estimate is used to
extract the data points for which the estimated density takes a value greater
than t. Next, the extracted points are clustered based on the eigenvectors of a
graph Laplacian matrix. Under mild assumptions, we prove the almost sure
convergence in operator norm of the empirical graph Laplacian operator
associated with the algorithm. Furthermore, we give the typical behavior of the
representation of the dataset into the feature space, which establishes the
strong consistency of our proposed algorithm
On the convergence of maximum variance unfolding
Maximum Variance Unfolding is one of the main methods for (nonlinear)
dimensionality reduction. We study its large sample limit, providing specific
rates of convergence under standard assumptions. We find that it is consistent
when the underlying submanifold is isometric to a convex subset, and we provide
some simple examples where it fails to be consistent
Nonparametric regression on closed Riemannian manifolds
International audienceThe nonparametric estimation of the regression function of a real-valued random variable Y on a random object X val- ued in a closed Riemannian manifold M is considered. A regression estimator which generalizes kernel regression es- timators on Euclidean sample spaces is introduced. Under classical assumptions on the kernel and the bandwidth se- quence, the asymptotic bias and variance are obtained, and the estimator is shown to converge at the same L2-rate as kernel regression estimators on Euclidean spaces
Remember the Curse of Dimensionality: The Case of Goodness-of-Fit Testing in Arbitrary Dimension
Despite a substantial literature on nonparametric two-sample goodness-of-fit
testing in arbitrary dimensions spanning decades, there is no mention there of
any curse of dimensionality. Only more recently Ramdas et al. (2015) have
discussed this issue in the context of kernel methods by showing that their
performance degrades with the dimension even when the underlying distributions
are isotropic Gaussians. We take a minimax perspective and follow in the
footsteps of Ingster (1987) to derive the minimax rate in arbitrary dimension
when the discrepancy is measured in the L2 metric. That rate is revealed to be
nonparametric and exhibit a prototypical curse of dimensionality. We further
extend Ingster's work to show that the chi-squared test achieves the minimax
rate. Moreover, we show that the test can be made to work when the
distributions have support of low intrinsic dimension. Finally, inspired by
Ingster (2000), we consider a multiscale version of the chi-square test which
can adapt to unknown smoothness and/or unknown intrinsic dimensionality without
much loss in power.Comment: This version comes after the publication of the paper in the Journal
of Nonparametric Statistics. The main change is to cite the work of Ramdas et
al. Some very minor typos were also correcte
Inference in phi-families of distributions
International audienceThis paper is devoted to the study of the parametric family of multivari- ate distributions obtained by minimizing a convex functional under linear constraints. Under certain assumptions on the convex functional, it is es- tablished that this family admits an affine parametrization, and parametric estimation from an i.i.d. random sample is studied. It is also shown that the members of this family are the limit distributions arising in inference based on empirical likelihood. As a consequence, given a probability measure ÎĽ0 and an i.i.d. random sample drawn from ÎĽ0, nonparametric confidence do- mains on the generalized moments of ÎĽ0 are obtained
The Normalized Graph Cut and Cheeger Constant: from Discrete to Continuous
Let M be a bounded domain of a Euclidian space with smooth boundary. We
relate the Cheeger constant of M and the conductance of a neighborhood graph
defined on a random sample from M. By restricting the minimization defining the
latter over a particular class of subsets, we obtain consistency (after
normalization) as the sample size increases, and show that any minimizing
sequence of subsets has a subsequence converging to a Cheeger set of M
Bayesian Methodology for Ocean Color Remote Sensing
66 pagesThe inverse ocean color problem, i.e., the retrieval of marine reflectance from top-of-atmosphere (TOA) reflectance, is examined in a Bayesian context. The solution is expressed as a probability distribution that measures the likelihood of encountering specific values of the marine reflectance given the observed TOA reflectance. This conditional distribution, the posterior distribution, allows the construction of reliable multi-dimensional confidence domains of the retrieved marine reflectance. The expectation and covariance of the posterior distribution are computed, which gives for each pixel an estimate of the marine reflectance and a measure of its uncertainty. Situations for which forward model and observation are incompatible are also identified. Prior distributions of the forward model parameters that are suitable for use at the global scale, as well as a noise model, are determined. Partition-based models are defined and implemented for SeaWiFS, to approximate numerically the expectation and covariance. The ill-posed nature of the inverse problem is illustrated, indicating that a large set of ocean and atmospheric states, or pre-images, may correspond to very close values of the satellite signal. Theoretical performance is good globally, i.e., on average over all the geometric and geophysical situations considered, with negligible biases and standard deviation decreasing from 0.004 at 412 nm to 0.001 at 670 nm. Errors are smaller for geometries that avoid Sun glint and minimize air mass and aerosol influence, and for small aerosol optical thickness and maritime aerosols. The estimated uncertainty is consistent with the inversion error. The theoretical concepts and inverse models are applied to actual SeaWiFS imagery, and comparisons are made with estimates from the SeaDAS standard atmospheric correction algorithm and in situ measurements. The Bayesian and SeaDAS marine reflectance fields exhibit resemblance in patterns of variability, but the Bayesian imagery is less noisy and characterized by different spatial de-correlation scales, with more realistic values in the presence of absorbing aerosols. Experimental errors obtained from match-up data are similar to the theoretical errors determined from simulated data. Regionalization of the inverse models is a natural development to improve retrieval accuracy, for example by including explicit knowledge of the space and time variability of atmospheric variables
Maximum entropy solution to ill-posed inverse problems with approximately known operator
International audienceWe consider the linear inverse problem of reconstructing an unknown finite measure μ from a noisy observation of a generalized moment of μ defined as the integral of a continuous and bounded operator Φ with respect to μ. Motivated by various applications, we focus on the case where the operator Φ is unknown; instead, only an approximation Φm to it is available. An approximate maximum entropy solution to the inverse problem is introduced in the form of a minimizer of a convex functional subject to a sequence of convex constraints. Under several assumptions on the convex functional, the convergence of the approximate solution is established
Sur l'estimation du support d'une densité
International audienceEtant donnée une densité de probabilité multivariée inconnue à support compact et un -échantillon i.i.d. issu de , nous étudions l'estimateur du support de défini par l'union des boules de rayon centrées sur les observations. Afin de mesurer la qualité de l'estimation, nous utilisons un critère général fondé sur le volume de la différence symétrique. Sous quelques hypothèses peu restrictives, et en utilisant des outils de la géométrie riemannienne, nous établissons les vitesses de convergence exactes de l'estimateur du support tout en examinant les conséquences statistiques de ces résultats
Clustering by Estimation of Density Level Sets at a Fixed Probability
In density-based clustering methods, the clusters are defined as the connected components of the upper level sets of the underlying density . In this setting, the practitioner fixes a probability , and associates with it a threshold such that the level set has a probability with respect to the distribution induced by . This paper is devoted to the estimation of the threshold , of the level set , as well as of the number of connected components of this level set. Given a nonparametric density estimate of based on an i.i.d. -sample drawn from , we first propose a computationally simple estimate of , and we establish a concentration inequality for this estimate. Next, we consider the plug-in level set estimate , and we establish the exact convergence rate of the Lebesgue measure of the symmetric difference between and . Finally, we propose a computationally simple graph-based estimate of , which is shown to be consistent. Thus, the methodology yields a complete procedure for analyzing the grouping structure of the data, as varies over
- …