Search CORE

20 research outputs found

Efficient Numerical Integration in Reproducing Kernel Hilbert Spaces via Leverage Scores Sampling

Author: Chatalic Antoine
De Vito Ernesto
Rosasco Lorenzo
Schreuder Nicolas
Publication venue
Publication date: 22/11/2023
Field of study

In this work we consider the problem of numerical integration, i.e., approximating integrals with respect to a target probability measure using only pointwise evaluations of the integrand. We focus on the setting in which the target distribution is only accessible through a set of

n

i.i.d. observations, and the integrand belongs to a reproducing kernel Hilbert space. We propose an efficient procedure which exploits a small i.i.d. random subset of

m<n

samples drawn either uniformly or using approximate leverage scores from the initial observations. Our main result is an upper bound on the approximation error of this procedure for both sampling strategies. It yields sufficient conditions on the subsample size to recover the standard (optimal)

n^{-1/2}

rate while reducing drastically the number of functions evaluations, and thus the overall computational cost. Moreover, we obtain rates with respect to the number

m

of evaluations of the integrand which adapt to its smoothness, and match known optimal rates for instance for Sobolev spaces. We illustrate our theoretical findings with numerical experiments on real datasets, which highlight the attractive efficiency-accuracy tradeoff of our method compared to existing randomized and greedy quadrature methods. We note that, the problem of numerical integration in RKHS amounts to designing a discrete approximation of the kernel mean embedding of the target distribution. As a consequence, direct applications of our results also include the efficient computation of maximum mean discrepancies between distributions and the design of efficient kernel-based tests.Comment: 46 pages, 5 figures. Submitted to JML

arXiv.org e-Print Archive

Heteroscedastic Gaussian processes and random features: scalable motion primitives with guarantees

Author: Caldarelli Edoardo
Chatalic Antoine
Colomé Figueras Adrià
Rosasco Lorenzo
Torras Carme
Publication venue: Proceedings of Machine Learning Research (PMLR)
Publication date: 01/01/2023
Field of study

Heteroscedastic Gaussian processes (HGPs) are kernel-based, non-parametric models that can be used to infer nonlinear functions with time-varying noise. In robotics, they can be employed for learning from demonstration as motion primitives, i.e. as a model of the trajectories to be executed by the robot. HGPs provide variance estimates around the reference signal modeling the trajectory, capturing both the predictive uncertainty and the motion variability. However, similarly to standard Gaussian processes they suffer from a cubic complexity in the number of training points, due to the inversion of the kernel matrix. The uncertainty can be leveraged for more complex learning tasks, such as inferring the variable impedance profile required from a robotic manipulator. However, suitable approximations are needed to make HGPs scalable, at the price of potentially worsening the posterior mean and variance profiles. Motivated by these observations, we study the combination of HGPs and random features, which are a popular, data-independent approximation strategy of kernel functions. In a theoretical analysis, we provide novel guarantees on the approximation error of the HGP posterior due to random features. Moreover, we validate this scalable motion primitive on real robot data, related to the problem of variable impedance learning. In this way, we show that random features offer a viable and theoretically sound alternative for speeding up the trajectory processing, without sacrificing accuracy.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

M $^2$ M: A general method to perform various data analysis tasks from a differentially private sketch

Author: Annamraju Shreyas Kumar
Chatalic Antoine
de Montjoye Yves-Alexandre
Houssiau Florimond
Schellekens Vincent
Publication venue
Publication date: 25/11/2022
Field of study

Differential privacy is the standard privacy definition for performing analyses over sensitive data. Yet, its privacy budget bounds the number of tasks an analyst can perform with reasonable accuracy, which makes it challenging to deploy in practice. This can be alleviated by private sketching, where the dataset is compressed into a single noisy sketch vector which can be shared with the analysts and used to perform arbitrarily many analyses. However, the algorithms to perform specific tasks from sketches must be developed on a case-by-case basis, which is a major impediment to their use. In this paper, we introduce the generic moment-to-moment (M

^2

M) method to perform a wide range of data exploration tasks from a single private sketch. Among other things, this method can be used to estimate empirical moments of attributes, the covariance matrix, counting queries (including histograms), and regression models. Our method treats the sketching mechanism as a black-box operation, and can thus be applied to a wide variety of sketches from the literature, widening their ranges of applications without further engineering or privacy loss, and removing some of the technical barriers to the wider adoption of sketches for data exploration under differential privacy. We validate our method with data exploration tasks on artificial and real-world data, and show that it can be used to reliably estimate statistics and train classification models from private sketches.Comment: Published at the 18th International Workshop on Security and Trust Management (STM 2022

arXiv.org e-Print Archive

Compressive Learning with Privacy Guarantees

Author: Chatalic Antoine
De Montjoye Yves-Alexandre
Gribonval Rémi
Houssiau Florimond
Jacques Laurent
Schellekens Vincent
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2021
Field of study

International audienceThis work addresses the problem of learning from large collections of data with privacy guarantees. The compressive learning framework proposes to deal with the large scale of datasets by compressing them into a single vector of generalized random moments, from which the learning task is then performed. We show that a simple perturbation of this mechanism with additive noise is sufficient to satisfy differential privacy, a well established formalism for defining and quantifying the privacy of a random mechanism. We combine this with a feature subsampling mechanism, which reduces the computational cost without damaging privacy. The framework is applied to the tasks of Gaussian modeling, k-means clustering and principal component analysis (PCA), for which sharp privacy bounds are derived. Empirically, the quality (for subsequent learning) of the compressed representation produced by our mechanism is strongly related with the induced noise level, for which we give analytical expressions

HAL-CentraleSupelec

HAL-ENS-LYON

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Méthodes efficaces pour l'apprentissage compressif avec garanties de confidentialité

Author: Chatalic Antoine
Publication venue: HAL CCSD
Publication date: 19/11/2020
Field of study

The topic of this Ph.D. thesis lies on the borderline between signal processing, statistics and computer science. It mainly focuses on compressive learning, a paradigm for large-scale machine learning in which the whole dataset is compressed down to a single vector of randomized generalized moments, called the sketch. An approximate solution of the learning task at hand is then estimated from this sketch, without using the initial data. This framework is by nature suited for learning from distributed collections or data streams, and has already been instantiated with success on several unsupervised learning tasks such as k-means clustering, density fitting using Gaussian mixture models, or principal component analysis. We improve this framework in multiple directions. First, it is shown that perturbing the sketch with additive noise is sufficient to derive (differential) privacy guarantees. Sharp bounds on the noise level required to obtain a given privacy level are provided, and the proposed method is shown empirically to compare favourably with state-of-the-art techniques. Then, the compression scheme is modified to leverage structured random matrices, which reduce the computational cost of the framework and make it possible to learn on high-dimensional data. Lastly, we introduce a new algorithm based on message passing techniques to learn from the sketch for the k-means clustering problem. These contributions open the way for a broader application of the framework.Ce travail de thèse, qui se situe à l'interface entre traitement du signal, informatique et statistiques, vise à l'élaboration de méthodes d'apprentissage automatique à grande échelle et de garanties théoriques associées. Il s'intéresse en particulier à l'apprentissage compressif, un paradigme dans lequel le jeu de données est compressé en un unique vecteur de moments généralisés aléatoires, appelé le sketch et contenant l'information nécessaire pour résoudre de manière approchée la tâche d'apprentissage considérée. Le schéma de compression utilisé permet de tirer profit d'une architecture distribuée ou de traiter des données en flux, et a déjà été utilisé avec succès sur plusieurs tâches d'apprentissage non-supervisé : partitionnement type k-moyennes, modélisation de densité avec modèle de mélange gaussien, analyse en composantes principales. Les contributions de la thèse s'intègrent dans ce cadre de plusieurs manières. D'une part, il est montré qu'en bruitant le sketch, des garanties de confidentialité (différentielle) peuvent être obtenues; des bornes exactes sur le niveau de bruit requis sont données, et une comparaison expérimentale permet d'établir que l'approche proposée est compétitive vis-à-vis d'autres méthodes récentes. Ensuite, le schéma de compression est adapté pour utiliser des matrices aléatoires structurées, qui permettent de réduire significativement les coûts de calcul et rendent possible l'utilisation de méthodes compressives sur des données de grande dimension. Enfin, un nouvel algorithme basé sur la propagation de convictions est proposé pour résoudre la phase d'apprentissage (à partir du sketch) pour le problème de partitionnement type k-moyennes

HAL-CentraleSupelec

Thèses en Ligne

INRIA a CCSD electronic archive server

Theses.fr

Hal-Diderot

HAL-Rennes 1

Projections aléatoires pour l'apprentissage compressif

Author: Chatalic Antoine
Gribonval Rémi
Keriven Nicolas
Publication venue: HAL CCSD
Publication date: 26/08/2019
Field of study

International audienceCompressive learning is a framework to drastically compress the volume of large training collections while preserving the information needed to learn. Guided by unsupervised learning examples, we survey the involved tools, the existing theoretical guarantees both in terms of information preservation and of privacy preservation, and conclude with some open problems.L'apprentissage compressif a pour objectif de réduire drastiquement le volume de grandes collections d'entraînement via des sortes de projections aléatoires, tout en préservant l'information nécessaire à l'apprentissage. En s'appuyant sur quelques exemples en apprentissage non-supervisé, nous proposons un tour d'horizon des outils utilisés, des garanties théoriques à la fois en termes de préservation d'information et de respect de la vie privée, pour finir avec quelques problèmes ouverts

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Large-Scale High-Dimensional Clustering with Fast Sketching

Author: Chatalic Antoine
Gribonval Rémi
Keriven Nicolas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/04/2018
Field of study

International audienceIn this paper, we address the problem of high-dimensional k-means clustering in a large-scale setting, i.e. for datasets that comprise a large number of items. Sketching techniques have already been used to deal with this “large-scale” issue, by compressing the whole dataset into a single vector of random nonlinear generalized moments from which the k centroids are then retrieved efficiently. However , this approach usually scales quadratically with the dimension; to cope with high-dimensional datasets, we show how to use fast structured random matrices to compute the sketching operator efficiently. This yields significant speed-ups and memory savings for high-dimensional data, while the clustering results are shown to be much more stable, both on artificial and real datasets

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

Nyström Kernel Mean Embeddings

Author: Chatalic Antoine
Rosasco Lorenzo
Rudi Alessandro
Schreuder Nicolas
Publication venue: HAL CCSD
Publication date: 17/07/2022
Field of study

8 pagesInternational audienceKernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nystr\"om method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard

n^{-1/2}

rate while reducing computational costs. We discuss applications of this result for the approximation of the maximum mean discrepancy and quadrature rules, and illustrate our theoretical findings with numerical experiments

HAL-CentraleSupelec

Nyström Kernel Mean Embeddings

Author: Chatalic Antoine
Rosasco Lorenzo
Rudi Alessandro
Schreuder Nicolas
Publication venue: HAL CCSD
Publication date: 17/07/2022
Field of study

n^{-1/2}

HAL-Rennes 1

Nystr\"om Kernel Mean Embeddings

Author: Chatalic Antoine
Rosasco Lorenzo
Rudi Alessandro
Schreuder Nicolas
Publication venue
Publication date: 15/06/2022
Field of study

Kernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nystr\"om method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard

n^{-1/2}

arXiv.org e-Print Archive