37 research outputs found

    Anisotropic oracle inequalities in noisy quantization

    Get PDF
    The effect of errors in variables in quantization is investigated. We prove general exact and non-exact oracle inequalities with fast rates for an empirical minimization based on a noisy sample Zi=Xi+ϵi,i=1,…,nZ_i=X_i+\epsilon_i,i=1,\ldots,n, where XiX_i are i.i.d. with density ff and ϵi\epsilon_i are i.i.d. with density η\eta. These rates depend on the geometry of the density ff and the asymptotic behaviour of the characteristic function of η\eta. This general study can be applied to the problem of kk-means clustering with noisy data. For this purpose, we introduce a deconvolution kk-means stochastic minimization which reaches fast rates of convergence under standard Pollard's regularity assumptions.Comment: 30 pages. arXiv admin note: text overlap with arXiv:1205.141

    Fast rates for noisy clustering

    Get PDF
    The effect of errors in variables in empirical minimization is investigated. Given a loss ll and a set of decision rules G\mathcal{G}, we prove a general upper bound for an empirical minimization based on a deconvolution kernel and a noisy sample Zi=Xi+ϵi,i=1,...,nZ_i=X_i+\epsilon_i,i=1,...,n. We apply this general upper bound to give the rate of convergence for the expected excess risk in noisy clustering. A recent bound from \citet{levrard} proves that this rate is O(1/n)\mathcal{O}(1/n) in the direct case, under Pollard's regularity assumptions. Here the effect of noisy measurements gives a rate of the form O(1/nγγ+2β)\mathcal{O}(1/n^{\frac{\gamma}{\gamma+2\beta}}), where γ\gamma is the H\"older regularity of the density of XX whereas β\beta is the degree of illposedness

    Bandwidth selection in kernel empirical risk minimization via the gradient

    Get PDF
    In this paper, we deal with the data-driven selection of multidimensional and possibly anisotropic bandwidths in the general framework of kernel empirical risk minimization. We propose a universal selection rule, which leads to optimal adaptive results in a large variety of statistical models such as nonparametric robust regression and statistical learning with errors in variables. These results are stated in the context of smooth loss functions, where the gradient of the risk appears as a good criterion to measure the performance of our estimators. The selection rule consists of a comparison of gradient empirical risks. It can be viewed as a nontrivial improvement of the so-called Goldenshluger-Lepski method to nonlinear estimators. Furthermore, one main advantage of our selection rule is the nondependency on the Hessian matrix of the risk, usually involved in standard adaptive procedures.Comment: Published at http://dx.doi.org/10.1214/15-AOS1318 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    The algorithm of noisy k-means

    Get PDF
    In this note, we introduce a new algorithm to deal with finite dimensional clustering with errors in variables. The design of this algorithm is based on recent theoretical advances (see Loustau (2013a,b)) in statistical learning with errors in variables. As the previous mentioned papers, the algorithm mixes different tools from the inverse problem literature and the machine learning community. Coarsely, it is based on a two-step procedure: (1) a deconvolution step to deal with noisy inputs and (2) Newton's iterations as the popular k-means

    A Quasi-Bayesian Perspective to Online Clustering

    Get PDF
    When faced with high frequency streams of data, clustering raises theoretical and algorithmic pitfalls. We introduce a new and adaptive online clustering algorithm relying on a quasi-Bayesian approach, with a dynamic (i.e., time-dependent) estimation of the (unknown and changing) number of clusters. We prove that our approach is supported by minimax regret bounds. We also provide an RJMCMC-flavored implementation (called PACBO, see https://cran.r-project.org/web/packages/PACBO/index.html) for which we give a convergence guarantee. Finally, numerical experiments illustrate the potential of our procedure

    Noisy classification with boundary assumptions

    Get PDF
    We address the problem of classification when data are collected from two samples with measurement errors. This problem turns to be an inverse problem and requires a specific treatment. In this context, we investigate the minimax rates of convergence using both a margin assumption, and a smoothness condition on the boundary of the set associated to the Bayes classifier. We establish lower and upper bounds (based on a deconvolution classifier) on these rates

    Noisy classification with boundary assumptions

    Get PDF
    We address the problem of classification when data are collected from two samples with measurement errors. This problem turns to be an inverse problem and requires a specific treatment. In this context, we investigate the minimax rates of convergence using both a margin assumption, and a smoothness condition on the boundary of the set associated to the Bayes classifier. We establish lower and upper bounds (based on a deconvolution classifier) on these rates

    Minimax fast rates for discriminant analysis with errors in variables

    Get PDF
    The effect of measurement errors in discriminant analysis is investigated. Given observations Z=X+ϵZ=X+\epsilon, where ϵ\epsilon denotes a random noise, the goal is to predict the density of XX among two possible candidates ff and gg. We suppose that we have at our disposal two learning samples. The aim is to approach the best possible decision rule G⋆G^\star defined as a minimizer of the Bayes risk. In the free-noise case (ϵ=0)(\epsilon=0), minimax fast rates of convergence are well-known under the margin assumption in discriminant analysis (see \cite{mammen}) or in the more general classification framework (see \cite{tsybakov2004,AT}). In this paper we intend to establish similar results in the noisy case, i.e. when dealing with errors in variables. We prove minimax lower bounds for this problem and explain how can these rates be attained, using in particular an Empirical Risk Minimizer (ERM) method based on deconvolution kernel estimators

    Clustering en ligne : le point de vue PAC-bayésien

    Get PDF
    Nous nous intéressons dans ce travail à la construction et à la mise en oeuvre d'une méthode de clustering en ligne. Face à des flux de données massives, le clustering est une gageure tant d'un point de vue théorique qu'algorithmique. Nous proposons un nouvel algorithme de clustering en ligne, reposant sur l'approche PAC-bayésienne. En particulier, le nombre de clusters est estimé dynamiquement (c'est-à-dire qu'il peut changer au cours du temps), et nous démontrons des bornes de regret parcimonieuses. De plus, un algorithme via RJMCMC, appelé Paco est présenté, et ses performances sur données simulées seront commentées. Mots-clés. Bornes de regret parcimonieuses, Clustering en ligne, Reversible Jump MCMC, Théorie PAC-bayésienne. Abstract. We address the online clustering problem. When faced with high frequency streams of data, clustering raises theoretical and algorithmic pitfalls. Working under a sparsity assumption, a new online clustering algorithm is introduced. Our procedure relies on the PAC-Bayesian approach, allowing for a dynamic (i.e., time-dependent) estimation of the number of clusters. Its theoretical merits are supported by sparsity regret bounds, and an RJMCMC-flavored implementation called Paco is proposed along with numerical experiments to assess its potential

    Temperature extremes of 2022 reduced carbon uptake by forests in Europe

    Get PDF
    The year 2022 saw record breaking temperatures in Europe during both summer and fall. Similar to the recent 2018 drought, close to 30% (3.0 million km2) of the European continent was under severe summer drought. In 2022, the drought was located in central and southeastern Europe, contrasting the Northern-centered 2018 drought. We show, using multiple sets of observations, a reduction of net biospheric carbon uptake in summer (56-62 TgC) over the drought area. Specific sites in France even showed a widespread summertime carbon release by forests, additional to wildfires. Partial compensation (32%) for the decreased carbon uptake due to drought was offered by a warm autumn with prolonged biospheric carbon uptake. The severity of this second drought event in 5 years suggests drought-induced reduced carbon uptake to no longer be exceptional, and important to factor into Europe’s developing plans for net-zero greenhouse gas emissions that rely on carbon uptake by forests
    corecore