2,317 research outputs found
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Improved differential privacy for SGD via optimal private linear operators on adaptive streams
CCF-1763786 - National Science Foundation; Apple, Inchttps://arxiv.org/abs/2202.0831
Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning
We introduce new differentially private (DP) mechanisms for gradient-based
machine learning (ML) with multiple passes (epochs) over a dataset,
substantially improving the achievable privacy-utility-computation tradeoffs.
We formalize the problem of DP mechanisms for adaptive streams with multiple
participations and introduce a non-trivial extension of online matrix
factorization DP mechanisms to our setting. This includes establishing the
necessary theory for sensitivity calculations and efficient computation of
optimal matrices. For some applications like SGD steps, applying
these optimal techniques becomes computationally expensive. We thus design an
efficient Fourier-transform-based mechanism with only a minor utility loss.
Extensive empirical evaluation on both example-level DP for image
classification and user-level DP for language modeling demonstrate substantial
improvements over all previous methods, including the widely-used DP-SGD .
Though our primary application is to ML, our main DP results are applicable to
arbitrary linear queries and hence may have much broader applicability.Comment: 9 pages main-text, 3 figures. 40 pages with 13 figures tota
(Amplified) Banded Matrix Factorization: A unified approach to private training
Matrix factorization (MF) mechanisms for differential privacy (DP) have
substantially improved the state-of-the-art in privacy-utility-computation
tradeoffs for ML applications in a variety of scenarios, but in both the
centralized and federated settings there remain instances where either MF
cannot be easily applied, or other algorithms provide better tradeoffs
(typically, as becomes small). In this work, we show how MF can
subsume prior state-of-the-art algorithms in both federated and centralized
training settings, across all privacy budgets. The key technique throughout is
the construction of MF mechanisms with banded matrices (lower-triangular
matrices with at most nonzero bands including the main diagonal). For
cross-device federated learning (FL), this enables multiple-participations with
a relaxed device participation schema compatible with practical FL
infrastructure (as demonstrated by a production deployment). In the centralized
setting, we prove that banded matrices enjoy the same privacy amplification
results as the ubiquitous DP-SGD algorithm, but can provide strictly better
performance in most scenarios -- this lets us always at least match DP-SGD, and
often outperform it.Comment: 34 pages, 13 figure
Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy
We study gradient descent under linearly correlated noise. Our work is
motivated by recent practical methods for optimization with differential
privacy (DP), such as DP-FTRL, which achieve strong performance in settings
where privacy amplification techniques are infeasible (such as in federated
learning). These methods inject privacy noise through a matrix factorization
mechanism, making the noise linearly correlated over iterations. We propose a
simplified setting that distills key facets of these methods and isolates the
impact of linearly correlated noise. We analyze the behavior of gradient
descent in this setting, for both convex and non-convex functions. Our analysis
is demonstrably tighter than prior work and recovers multiple important special
cases exactly (including anticorrelated perturbed gradient descent). We use our
results to develop new, effective matrix factorizations for differentially
private optimization, and highlight the benefits of these factorizations
theoretically and empirically
Prochlo: Strong Privacy for Analytics in the Crowd
The large-scale monitoring of computer users' software activities has become
commonplace, e.g., for application telemetry, error reporting, or demographic
profiling. This paper describes a principled systems architecture---Encode,
Shuffle, Analyze (ESA)---for performing such monitoring with high utility while
also protecting user privacy. The ESA design, and its Prochlo implementation,
are informed by our practical experiences with an existing, large deployment of
privacy-preserving software monitoring.
(cont.; see the paper
- …