51 research outputs found
Finding heavy hitters from lossy or noisy data
Abstract. Motivated by Dvir et al. and Wigderson and Yehudayoff [3
Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution
The degree distribution is one of the most fundamental graph properties of
interest for real-world graphs. It has been widely observed in numerous domains
that graphs typically have a tailed or scale-free degree distribution. While
the average degree is usually quite small, the variance is quite high and there
are vertices with degrees at all scales. We focus on the problem of
approximating the degree distribution of a large streaming graph, with small
storage. We design an algorithm headtail, whose main novelty is a new estimator
of infrequent degrees using truncated geometric random variables. We give a
mathematical analysis of headtail and show that it has excellent behavior in
practice. We can process streams will millions of edges with storage less than
1% and get extremely accurate approximations for all scales in the degree
distribution.
We also introduce a new notion of Relative Hausdorff distance between tailed
histograms. Existing notions of distances between distributions are not
suitable, since they ignore infrequent degrees in the tail. The Relative
Hausdorff distance measures deviations at all scales, and is a more suitable
distance for comparing degree distributions. By tracking this new measure, we
are able to give strong empirical evidence of the convergence of headtail
A Polynomial Time Algorithm for Lossy Population Recovery
We give a polynomial time algorithm for the lossy population recovery
problem. In this problem, the goal is to approximately learn an unknown
distribution on binary strings of length from lossy samples: for some
parameter each coordinate of the sample is preserved with probability
and otherwise is replaced by a `?'. The running time and number of
samples needed for our algorithm is polynomial in and for
each fixed . This improves on algorithm of Wigderson and Yehudayoff that
runs in quasi-polynomial time for any and the polynomial time
algorithm of Dvir et al which was shown to work for by
Batman et al. In fact, our algorithm also works in the more general framework
of Batman et al. in which there is no a priori bound on the size of the support
of the distribution. The algorithm we analyze is implicit in previous work; our
main contribution is to analyze the algorithm by showing (via linear
programming duality and connections to complex analysis) that a certain matrix
associated with the problem has a robust local inverse even though its
condition number is exponentially small. A corollary of our result is the first
polynomial time algorithm for learning DNFs in the restriction access model of
Dvir et al
Noisy population recovery in polynomial time
In the noisy population recovery problem of Dvir et al., the goal is to learn
an unknown distribution on binary strings of length from noisy samples.
For some parameter , a noisy sample is generated by flipping
each coordinate of a sample from independently with probability
. We assume an upper bound on the size of the support of the
distribution, and the goal is to estimate the probability of any string to
within some given error . It is known that the algorithmic
complexity and sample complexity of this problem are polynomially related to
each other.
We show that for , the sample complexity (and hence the algorithmic
complexity) is bounded by a polynomial in , and
improving upon the previous best result of due to Lovett and Zhang.
Our proof combines ideas from Lovett and Zhang with a \emph{noise attenuated}
version of M\"{o}bius inversion. In turn, the latter crucially uses the
construction of \emph{robust local inverse} due to Moitra and Saks
Private Data Stream Analysis for Universal Symmetric Norm Estimation
We study how to release summary statistics on a data stream subject to the
constraint of differential privacy. In particular, we focus on releasing the
family of symmetric norms, which are invariant under sign-flips and
coordinate-wise permutations on an input data stream and include norms,
-support norms, top- norms, and the box norm as special cases. Although
it may be possible to design and analyze a separate mechanism for each
symmetric norm, we propose a general parametrizable framework that
differentially privately releases a number of sufficient statistics from which
the approximation of all symmetric norms can be simultaneously computed. Our
framework partitions the coordinates of the underlying frequency vector into
different levels based on their magnitude and releases approximate frequencies
for the "heavy" coordinates in important levels and releases approximate level
sizes for the "light" coordinates in important levels. Surprisingly, our
mechanism allows for the release of an arbitrary number of symmetric norm
approximations without any overhead or additional loss in privacy. Moreover,
our mechanism permits -approximation to each of the symmetric norms
and can be implemented using sublinear space in the streaming model for many
regimes of the accuracy and privacy parameters
Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded Memory
Top- frequent items detection is a fundamental task in data stream mining.
Many promising solutions are proposed to improve memory efficiency while still
maintaining high accuracy for detecting the Top- items. Despite the memory
efficiency concern, the users could suffer from privacy loss if participating
in the task without proper protection, since their contributed local data
streams may continually leak sensitive individual information. However, most
existing works solely focus on addressing either the memory-efficiency problem
or the privacy concerns but seldom jointly, which cannot achieve a satisfactory
tradeoff between memory efficiency, privacy protection, and detection accuracy.
In this paper, we present a novel framework HG-LDP to achieve accurate
Top- item detection at bounded memory expense, while providing rigorous
local differential privacy (LDP) protection. Specifically, we identify two key
challenges naturally arising in the task, which reveal that directly applying
existing LDP techniques will lead to an inferior ``accuracy-privacy-memory
efficiency'' tradeoff. Therefore, we instantiate three advanced schemes under
the framework by designing novel LDP randomization methods, which address the
hurdles caused by the large size of the item domain and by the limited space of
the memory. We conduct comprehensive experiments on both synthetic and
real-world datasets to show that the proposed advanced schemes achieve a
superior ``accuracy-privacy-memory efficiency'' tradeoff, saving
memory over baseline methods when the item domain size is . Our code is
open-sourced via the link
Private Data Stream Analysis for Universal Symmetric Norm Estimation
We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include L_p norms, k-support norms, top-k norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits (1+?)-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters
SoK: Chasing Accuracy and Privacy, and Catching Both in Differentially Private Histogram Publication
Histograms and synthetic data are of key importance in data analysis. However, researchers have shown that even aggregated data such as histograms, containing no obvious sensitive attributes, can result in privacy leakage. To enable data analysis, a strong notion of privacy is required to avoid risking unintended privacy violations.Such a strong notion of privacy is differential privacy, a statistical notion of privacy that makes privacy leakage quantifiable. The caveat regarding differential privacy is that while it has strong guarantees for privacy, privacy comes at a cost of accuracy. Despite this trade-off being a central and important issue in the adoption of differential privacy, there exists a gap in the literature regarding providing an understanding of the trade-off and how to address it appropriately. Through a systematic literature review (SLR), we investigate the state-of-the-art within accuracy improving differentially private algorithms for histogram and synthetic data publishing. Our contribution is two-fold: 1) we identify trends and connections in the contributions to the field of differential privacy for histograms and synthetic data and 2) we provide an understanding of the privacy/accuracy trade-off challenge by crystallizing different dimensions to accuracy improvement. Accordingly, we position and visualize the ideas in relation to each other and external work, and deconstruct each algorithm to examine the building blocks separately with the aim of pinpointing which dimension of accuracy improvement each technique/approach is targeting. Hence, this systematization of knowledge (SoK) provides an understanding of in which dimensions and how accuracy improvement can be pursued without sacrificing privacy
- …