200 research outputs found
How to Make Your Approximation Algorithm Private: A Black-Box Differentially-Private Transformation for Tunable Approximation Algorithms of Functions with Low Sensitivity
We develop a framework for efficiently transforming certain approximation
algorithms into differentially-private variants, in a black-box manner. Our
results focus on algorithms A that output an approximation to a function f of
the form , where 0<=a <1 is a parameter
that can be``tuned" to small-enough values while incurring only a poly blowup
in the running time/space. We show that such algorithms can be made DP without
sacrificing accuracy, as long as the function f has small global sensitivity.
We achieve these results by applying the smooth sensitivity framework developed
by Nissim, Raskhodnikova, and Smith (STOC 2007).
Our framework naturally applies to transform non-private FPRAS (resp. FPTAS)
algorithms into -DP (resp. -DP) approximation
algorithms. We apply our framework in the context of sublinear-time and
sublinear-space algorithms, while preserving the nature of the algorithm in
meaningful ranges of the parameters. Our results include the first (to the best
of our knowledge) -edge DP sublinear-time algorithm for
estimating the number of triangles, the number of connected components, and the
weight of a MST of a graph, as well as a more efficient algorithm (while
sacrificing pure DP in contrast to previous results) for estimating the average
degree of a graph. In the area of streaming algorithms, our results include
-DP algorithms for estimating L_p-norms, distinct elements,
and weighted MST for both insertion-only and turnstile streams. Our
transformation also provides a private version of the smooth histogram
framework, which is commonly used for converting streaming algorithms into
sliding window variants, and achieves a multiplicative approximation to many
problems, such as estimating L_p-norms, distinct elements, and the length of
the longest increasing subsequence
Differentially Private Continual Releases of Streaming Frequency Moment Estimations
The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible.
Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental ?_p (p ? [0,+?)) frequency moment estimation problem under this setting, and give an ?-DP algorithm that achieves (1+?)-relative approximation (? ? ? (0,1)) with polylog(Tn) additive error and uses polylog(Tn)? max(1, n^{1-2/p}) space, where T is the length of the stream and n is the size of the universe of elements. Our space is near optimal up to poly-logarithmic factors even in the non-private setting.
To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting. Based on these primitives, we develop a differentially private continual release level set estimation approach to address the ?_p frequency moment estimation problem.
We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past W data items
Frequency Estimation Under Multiparty Differential Privacy: One-shot and Streaming
We study the fundamental problem of frequency estimation under both privacy
and communication constraints, where the data is distributed among parties.
We consider two application scenarios: (1) one-shot, where the data is static
and the aggregator conducts a one-time computation; and (2) streaming, where
each party receives a stream of items over time and the aggregator continuously
monitors the frequencies. We adopt the model of multiparty differential privacy
(MDP), which is more general than local differential privacy (LDP) and
(centralized) differential privacy. Our protocols achieve optimality (up to
logarithmic factors) permissible by the more stringent of the two constraints.
In particular, when specialized to the -LDP model, our protocol
achieves an error of using bits of communication and
bits of public randomness, where is the size of the domain
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Private Data Stream Analysis for Universal Symmetric Norm Estimation
We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include L_p norms, k-support norms, top-k norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits (1+?)-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters
Private Data Stream Analysis for Universal Symmetric Norm Estimation
We study how to release summary statistics on a data stream subject to the
constraint of differential privacy. In particular, we focus on releasing the
family of symmetric norms, which are invariant under sign-flips and
coordinate-wise permutations on an input data stream and include norms,
-support norms, top- norms, and the box norm as special cases. Although
it may be possible to design and analyze a separate mechanism for each
symmetric norm, we propose a general parametrizable framework that
differentially privately releases a number of sufficient statistics from which
the approximation of all symmetric norms can be simultaneously computed. Our
framework partitions the coordinates of the underlying frequency vector into
different levels based on their magnitude and releases approximate frequencies
for the "heavy" coordinates in important levels and releases approximate level
sizes for the "light" coordinates in important levels. Surprisingly, our
mechanism allows for the release of an arbitrary number of symmetric norm
approximations without any overhead or additional loss in privacy. Moreover,
our mechanism permits -approximation to each of the symmetric norms
and can be implemented using sublinear space in the streaming model for many
regimes of the accuracy and privacy parameters
New Algorithms for Large Datasets and Distributions
In this dissertation, we make progress on certain algorithmic problems broadly over two computational models: the streaming model for large datasets and the distribution testing model for large probability distributions.
First we consider the streaming model, where a large sequence of data items arrives one by one. The computer needs to make one pass over this sequence, processing every item quickly, in a limited space. In Chapter 2 motivated by a bioinformatics application, we consider the problem of estimating the number of low-frequency items in a stream, which has received only a limited theoretical work so far. We give an efficient streaming algorithm for this problem and show its complexity is almost optimal.
In Chapter 3 we consider a distributed variation of the streaming model, where each item of the data sequence arrives arbitrarily to one among a set of computers, who together need to compute certain functions over the entire stream. In such scenarios combining the data at a computer is infeasible due to large communication overhead. We give the first algorithm for k-median clustering in this model. Moreover, we give new algorithms for frequency moments and clustering functions in the distributed sliding window model, where the computation is limited to the most recent W items, as the items arrive in the stream.
In Chapter 5, in our identity testing problem, given two distributions P (unknown, only samples are obtained) and Q (known) over a common sample space of exponential
size, we need to distinguish P = Q (output ‘yes’) versus P is far from Q (output ‘no’). This problem requires an exponential number of samples. To circumvent this lower bound, this problem was recently studied with certain structural assumptions. In particular, optimally efficient testers were given assuming P and Q are product distributions. For such product distributions, we give the first tolerant testers, which not only output yes when P = Q but also when P is close to Q, in Chapter 5. Likewise, we study the tolerant closeness testing problem for such product distributions, where Q too is accessed only by samples.
Adviser: Vinodchandran N. Variya
- …