16 research outputs found

    Efficient Numerical Integration in Reproducing Kernel Hilbert Spaces via Leverage Scores Sampling

    Full text link
    In this work we consider the problem of numerical integration, i.e., approximating integrals with respect to a target probability measure using only pointwise evaluations of the integrand. We focus on the setting in which the target distribution is only accessible through a set of nn i.i.d. observations, and the integrand belongs to a reproducing kernel Hilbert space. We propose an efficient procedure which exploits a small i.i.d. random subset of m<nm<n samples drawn either uniformly or using approximate leverage scores from the initial observations. Our main result is an upper bound on the approximation error of this procedure for both sampling strategies. It yields sufficient conditions on the subsample size to recover the standard (optimal) n−1/2n^{-1/2} rate while reducing drastically the number of functions evaluations, and thus the overall computational cost. Moreover, we obtain rates with respect to the number mm of evaluations of the integrand which adapt to its smoothness, and match known optimal rates for instance for Sobolev spaces. We illustrate our theoretical findings with numerical experiments on real datasets, which highlight the attractive efficiency-accuracy tradeoff of our method compared to existing randomized and greedy quadrature methods. We note that, the problem of numerical integration in RKHS amounts to designing a discrete approximation of the kernel mean embedding of the target distribution. As a consequence, direct applications of our results also include the efficient computation of maximum mean discrepancies between distributions and the design of efficient kernel-based tests.Comment: 46 pages, 5 figures. Submitted to JML

    M2^2M: A general method to perform various data analysis tasks from a differentially private sketch

    Full text link
    Differential privacy is the standard privacy definition for performing analyses over sensitive data. Yet, its privacy budget bounds the number of tasks an analyst can perform with reasonable accuracy, which makes it challenging to deploy in practice. This can be alleviated by private sketching, where the dataset is compressed into a single noisy sketch vector which can be shared with the analysts and used to perform arbitrarily many analyses. However, the algorithms to perform specific tasks from sketches must be developed on a case-by-case basis, which is a major impediment to their use. In this paper, we introduce the generic moment-to-moment (M2^2M) method to perform a wide range of data exploration tasks from a single private sketch. Among other things, this method can be used to estimate empirical moments of attributes, the covariance matrix, counting queries (including histograms), and regression models. Our method treats the sketching mechanism as a black-box operation, and can thus be applied to a wide variety of sketches from the literature, widening their ranges of applications without further engineering or privacy loss, and removing some of the technical barriers to the wider adoption of sketches for data exploration under differential privacy. We validate our method with data exploration tasks on artificial and real-world data, and show that it can be used to reliably estimate statistics and train classification models from private sketches.Comment: Published at the 18th International Workshop on Security and Trust Management (STM 2022

    Compressive Learning with Privacy Guarantees

    Get PDF
    International audienceThis work addresses the problem of learning from large collections of data with privacy guarantees. The compressive learning framework proposes to deal with the large scale of datasets by compressing them into a single vector of generalized random moments, from which the learning task is then performed. We show that a simple perturbation of this mechanism with additive noise is sufficient to satisfy differential privacy, a well established formalism for defining and quantifying the privacy of a random mechanism. We combine this with a feature subsampling mechanism, which reduces the computational cost without damaging privacy. The framework is applied to the tasks of Gaussian modeling, k-means clustering and principal component analysis (PCA), for which sharp privacy bounds are derived. Empirically, the quality (for subsequent learning) of the compressed representation produced by our mechanism is strongly related with the induced noise level, for which we give analytical expressions

    Méthodes efficaces pour l'apprentissage compressif avec garanties de confidentialité

    No full text
    The topic of this Ph.D. thesis lies on the borderline between signal processing, statistics and computer science. It mainly focuses on compressive learning, a paradigm for large-scale machine learning in which the whole dataset is compressed down to a single vector of randomized generalized moments, called the sketch. An approximate solution of the learning task at hand is then estimated from this sketch, without using the initial data. This framework is by nature suited for learning from distributed collections or data streams, and has already been instantiated with success on several unsupervised learning tasks such as k-means clustering, density fitting using Gaussian mixture models, or principal component analysis. We improve this framework in multiple directions. First, it is shown that perturbing the sketch with additive noise is sufficient to derive (differential) privacy guarantees. Sharp bounds on the noise level required to obtain a given privacy level are provided, and the proposed method is shown empirically to compare favourably with state-of-the-art techniques. Then, the compression scheme is modified to leverage structured random matrices, which reduce the computational cost of the framework and make it possible to learn on high-dimensional data. Lastly, we introduce a new algorithm based on message passing techniques to learn from the sketch for the k-means clustering problem. These contributions open the way for a broader application of the framework.Ce travail de thĂšse, qui se situe Ă  l'interface entre traitement du signal, informatique et statistiques, vise Ă  l'Ă©laboration de mĂ©thodes d'apprentissage automatique Ă  grande Ă©chelle et de garanties thĂ©oriques associĂ©es. Il s'intĂ©resse en particulier Ă  l'apprentissage compressif, un paradigme dans lequel le jeu de donnĂ©es est compressĂ© en un unique vecteur de moments gĂ©nĂ©ralisĂ©s alĂ©atoires, appelĂ© le sketch et contenant l'information nĂ©cessaire pour rĂ©soudre de maniĂšre approchĂ©e la tĂąche d'apprentissage considĂ©rĂ©e. Le schĂ©ma de compression utilisĂ© permet de tirer profit d'une architecture distribuĂ©e ou de traiter des donnĂ©es en flux, et a dĂ©jĂ  Ă©tĂ© utilisĂ© avec succĂšs sur plusieurs tĂąches d'apprentissage non-supervisé : partitionnement type k-moyennes, modĂ©lisation de densitĂ© avec modĂšle de mĂ©lange gaussien, analyse en composantes principales. Les contributions de la thĂšse s'intĂšgrent dans ce cadre de plusieurs maniĂšres. D'une part, il est montrĂ© qu'en bruitant le sketch, des garanties de confidentialitĂ© (diffĂ©rentielle) peuvent ĂȘtre obtenues; des bornes exactes sur le niveau de bruit requis sont donnĂ©es, et une comparaison expĂ©rimentale permet d'Ă©tablir que l'approche proposĂ©e est compĂ©titive vis-Ă -vis d'autres mĂ©thodes rĂ©centes. Ensuite, le schĂ©ma de compression est adaptĂ© pour utiliser des matrices alĂ©atoires structurĂ©es, qui permettent de rĂ©duire significativement les coĂ»ts de calcul et rendent possible l'utilisation de mĂ©thodes compressives sur des donnĂ©es de grande dimension. Enfin, un nouvel algorithme basĂ© sur la propagation de convictions est proposĂ© pour rĂ©soudre la phase d'apprentissage (Ă  partir du sketch) pour le problĂšme de partitionnement type k-moyennes

    Projections aléatoires pour l'apprentissage compressif

    Get PDF
    International audienceCompressive learning is a framework to drastically compress the volume of large training collections while preserving the information needed to learn. Guided by unsupervised learning examples, we survey the involved tools, the existing theoretical guarantees both in terms of information preservation and of privacy preservation, and conclude with some open problems.L'apprentissage compressif a pour objectif de réduire drastiquement le volume de grandes collections d'entraßnement via des sortes de projections aléatoires, tout en préservant l'information nécessaire à l'apprentissage. En s'appuyant sur quelques exemples en apprentissage non-supervisé, nous proposons un tour d'horizon des outils utilisés, des garanties théoriques à la fois en termes de préservation d'information et de respect de la vie privée, pour finir avec quelques problÚmes ouverts

    Large-Scale High-Dimensional Clustering with Fast Sketching

    Get PDF
    International audienceIn this paper, we address the problem of high-dimensional k-means clustering in a large-scale setting, i.e. for datasets that comprise a large number of items. Sketching techniques have already been used to deal with this “large-scale” issue, by compressing the whole dataset into a single vector of random nonlinear generalized moments from which the k centroids are then retrieved efficiently. However , this approach usually scales quadratically with the dimension; to cope with high-dimensional datasets, we show how to use fast structured random matrices to compute the sketching operator efficiently. This yields significant speed-ups and memory savings for high-dimensional data, while the clustering results are shown to be much more stable, both on artificial and real datasets

    Nystr\"om Kernel Mean Embeddings

    Full text link
    Kernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nystr\"om method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard n−1/2n^{-1/2} rate while reducing computational costs. We discuss applications of this result for the approximation of the maximum mean discrepancy and quadrature rules, and illustrate our theoretical findings with numerical experiments.Comment: 8 page

    Nyström Kernel Mean Embeddings

    No full text
    8 pagesInternational audienceKernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nystr\"om method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard n−1/2n^{-1/2} rate while reducing computational costs. We discuss applications of this result for the approximation of the maximum mean discrepancy and quadrature rules, and illustrate our theoretical findings with numerical experiments

    Sketched Clustering via Hybrid Approximate Message Passing

    Get PDF
    International audienceIn sketched clustering, a dataset of T samples is first sketched down to a vector of modest size, from which the centroids are subsequently extracted. Its advantages include 1) reduced storage complexity and 2) centroid extraction complexity independent of T. For the sketching methodology recently proposed by Keriven et al., which can be interpreted as a random sampling of the empirical characteristic function, we propose a sketched clustering algorithm based on approximate message passing. Numerical experiments suggest that our approach is more efficient than the state-of-the-art sketched clustering algorithm “CL-OMPR” (in both computational and sample complexity) and more efficient than k-means++ when T is large

    Sketched Clustering via Hybrid Approximate Message Passing

    No full text
    corecore