2 research outputs found
A One-Pass Private Sketch for Most Machine Learning Tasks
Differential privacy (DP) is a compelling privacy definition that explains
the privacy-utility tradeoff via formal, provable guarantees. Inspired by
recent progress toward general-purpose data release algorithms, we propose a
private sketch, or small summary of the dataset, that supports a multitude of
machine learning tasks including regression, classification, density
estimation, near-neighbor search, and more. Our sketch consists of randomized
contingency tables that are indexed with locality-sensitive hashing and
constructed with an efficient one-pass algorithm. We prove competitive error
bounds for DP kernel density estimation. Existing methods for DP kernel density
estimation scale poorly, often exponentially slower with an increase in
dimensions. In contrast, our sketch can quickly run on large, high-dimensional
datasets in a single pass. Exhaustive experiments show that our generic sketch
delivers a similar privacy-utility tradeoff when compared to existing DP
methods at a fraction of the computation cost. We expect that our sketch will
enable differential privacy in distributed, large-scale machine learning
settings.Comment: 10 pages, 4 figure
STORM: Foundations of End-to-End Empirical Risk Minimization on the Edge
Empirical risk minimization is perhaps the most influential idea in
statistical learning, with applications to nearly all scientific and technical
domains in the form of regression and classification models. To analyze massive
streaming datasets in distributed computing environments, practitioners
increasingly prefer to deploy regression models on edge rather than in the
cloud. By keeping data on edge devices, we minimize the energy, communication,
and data security risk associated with the model. Although it is equally
advantageous to train models at the edge, a common assumption is that the model
was originally trained in the cloud, since training typically requires
substantial computation and memory. To this end, we propose STORM, an online
sketch for empirical risk minimization. STORM compresses a data stream into a
tiny array of integer counters. This sketch is sufficient to estimate a variety
of surrogate losses over the original dataset. We provide rigorous theoretical
analysis and show that STORM can estimate a carefully chosen surrogate loss for
the least-squares objective. In an exhaustive experimental comparison for
linear regression models on real-world datasets, we find that STORM allows
accurate regression models to be trained