8,708 research outputs found
Data Analytics with Differential Privacy
Differential privacy is the state-of-the-art definition for privacy,
guaranteeing that any analysis performed on a sensitive dataset leaks no
information about the individuals whose data are contained therein. In this
thesis, we develop differentially private algorithms to analyze distributed and
streaming data. In the distributed model, we consider the particular problem of
learning -- in a distributed fashion -- a global model of the data, that can
subsequently be used for arbitrary analyses. We build upon PrivBayes, a
differentially private method that approximates the high-dimensional
distribution of a centralized dataset as a product of low-order distributions,
utilizing a Bayesian Network model. We examine three novel approaches to
learning a global Bayesian Network from distributed data, while offering the
differential privacy guarantee to all local datasets. Our work includes a
detailed theoretical analysis of the distributed, differentially private
entropy estimator which we use in one of our algorithms, as well as a detailed
experimental evaluation, using both synthetic and real-world data. In the
streaming model, we focus on the problem of estimating the density of a stream
of users, which expresses the fraction of all users that actually appear in the
stream. We offer one of the strongest privacy guarantees for the streaming
model, user-level pan-privacy, which ensures that the privacy of any user is
protected, even against an adversary that observes the internal state of the
algorithm. We provide a detailed analysis of an existing, sampling-based
algorithm for the problem and propose two novel modifications that
significantly improve it, both theoretically and experimentally, by optimally
using all the allocated "privacy budget."Comment: Diploma Thesis, School of Electrical and Computer Engineering,
Technical University of Crete, Chania, Greece, 201
Gozar: NAT-friendly Peer Sampling with One-Hop Distributed NAT Traversal
Gossip-based peer sampling protocols have been widely used as a building block for many large-scale distributed applications. However, Network Address Translation gateways (NATs) cause most existing gossiping protocols to break down, as nodes cannot establish direct connections to nodes behind NATs (private nodes). In addition, most of the existing NAT traversal algorithms for establishing connectivity to private nodes rely on third party servers running at a well-known, public IP addresses. In this paper, we present Gozar, a gossip-based peer sampling service that: (i) provides uniform random samples in the presence of NATs, and (ii) enables direct connectivity to sampled nodes using a fully distributed NAT traversal service, where connection messages require only a single
hop to connect to private nodes. We show in simulation that Gozar preserves the randomness properties of a gossip-based peer sampling service. We show the robustness of Gozar when a large fraction of nodes reside behind NATs and also in
catastrophic failure scenarios. For example, if 80% of nodes are behind NATs, and 80% of the nodes fail, more than 92% of the remaining nodes stay connected. In addition, we compare Gozar with existing NAT-friendly gossip-based peer sampling services, Nylon and ARRG. We show that Gozar is the only system that supports one-hop NAT traversal, and its overhead is roughly half of Nylon’s
Towards Tight Bounds for the Streaming Set Cover Problem
We consider the classic Set Cover problem in the data stream model. For
elements and sets () we give a -pass algorithm with a
strongly sub-linear space and logarithmic
approximation factor. This yields a significant improvement over the earlier
algorithm of Demaine et al. [DIMV14] that uses exponentially larger number of
passes. We complement this result by showing that the tradeoff between the
number of passes and space exhibited by our algorithm is tight, at least when
the approximation factor is equal to . Specifically, we show that any
algorithm that computes set cover exactly using passes
must use space in the regime of .
Furthermore, we consider the problem in the geometric setting where the
elements are points in and sets are either discs, axis-parallel
rectangles, or fat triangles in the plane, and show that our algorithm (with a
slight modification) uses the optimal space to find a
logarithmic approximation in passes.
Finally, we show that any randomized one-pass algorithm that distinguishes
between covers of size 2 and 3 must use a linear (i.e., ) amount of
space. This is the first result showing that a randomized, approximate
algorithm cannot achieve a space bound that is sublinear in the input size.
This indicates that using multiple passes might be necessary in order to
achieve sub-linear space bounds for this problem while guaranteeing small
approximation factors.Comment: A preliminary version of this paper is to appear in PODS 201
Private Incremental Regression
Data is continuously generated by modern data sources, and a recent challenge
in machine learning has been to develop techniques that perform well in an
incremental (streaming) setting. In this paper, we investigate the problem of
private machine learning, where as common in practice, the data is not given at
once, but rather arrives incrementally over time.
We introduce the problems of private incremental ERM and private incremental
regression where the general goal is to always maintain a good empirical risk
minimizer for the history observed under differential privacy. Our first
contribution is a generic transformation of private batch ERM mechanisms into
private incremental ERM mechanisms, based on a simple idea of invoking the
private batch ERM procedure at some regular time intervals. We take this
construction as a baseline for comparison. We then provide two mechanisms for
the private incremental regression problem. Our first mechanism is based on
privately constructing a noisy incremental gradient function, which is then
used in a modified projected gradient procedure at every timestep. This
mechanism has an excess empirical risk of , where is the
dimensionality of the data. While from the results of [Bassily et al. 2014]
this bound is tight in the worst-case, we show that certain geometric
properties of the input and constraint set can be used to derive significantly
better results for certain interesting regression problems.Comment: To appear in PODS 201
Fast Private Data Release Algorithms for Sparse Queries
We revisit the problem of accurately answering large classes of statistical
queries while preserving differential privacy. Previous approaches to this
problem have either been very general but have not had run-time polynomial in
the size of the database, have applied only to very limited classes of queries,
or have relaxed the notion of worst-case error guarantees. In this paper we
consider the large class of sparse queries, which take non-zero values on only
polynomially many universe elements. We give efficient query release algorithms
for this class, in both the interactive and the non-interactive setting. Our
algorithms also achieve better accuracy bounds than previous general techniques
do when applied to sparse queries: our bounds are independent of the universe
size. In fact, even the runtime of our interactive mechanism is independent of
the universe size, and so can be implemented in the "infinite universe" model
in which no finite universe need be specified by the data curator
- …