259 research outputs found
Blowfish Privacy: Tuning Privacy-Utility Trade-offs using Policies
Privacy definitions provide ways for trading-off the privacy of individuals
in a statistical database for the utility of downstream analysis of the data.
In this paper, we present Blowfish, a class of privacy definitions inspired by
the Pufferfish framework, that provides a rich interface for this trade-off. In
particular, we allow data publishers to extend differential privacy using a
policy, which specifies (a) secrets, or information that must be kept secret,
and (b) constraints that may be known about the data. While the secret
specification allows increased utility by lessening protection for certain
individual properties, the constraint specification provides added protection
against an adversary who knows correlations in the data (arising from
constraints). We formalize policies and present novel algorithms that can
handle general specifications of sensitive information and certain count
constraints. We show that there are reasonable policies under which our privacy
mechanisms for k-means clustering, histograms and range queries introduce
significantly lesser noise than their differentially private counterparts. We
quantify the privacy-utility trade-offs for various policies analytically and
empirically on real datasets.Comment: Full version of the paper at SIGMOD'14 Snowbird, Utah US
Low-Rank Mechanism: Optimizing Batch Queries under Differential Privacy
Differential privacy is a promising privacy-preserving paradigm for
statistical query processing over sensitive data. It works by injecting random
noise into each query result, such that it is provably hard for the adversary
to infer the presence or absence of any individual record from the published
noisy results. The main objective in differentially private query processing is
to maximize the accuracy of the query results, while satisfying the privacy
guarantees. Previous work, notably the matrix mechanism, has suggested that
processing a batch of correlated queries as a whole can potentially achieve
considerable accuracy gains, compared to answering them individually. However,
as we point out in this paper, the matrix mechanism is mainly of theoretical
interest; in particular, several inherent problems in its design limit its
accuracy in practice, which almost never exceeds that of naive methods. In
fact, we are not aware of any existing solution that can effectively optimize a
query batch under differential privacy. Motivated by this, we propose the
Low-Rank Mechanism (LRM), the first practical differentially private technique
for answering batch queries with high accuracy, based on a low rank
approximation of the workload matrix. We prove that the accuracy provided by
LRM is close to the theoretical lower bound for any mechanism to answer a batch
of queries under differential privacy. Extensive experiments using real data
demonstrate that LRM consistently outperforms state-of-the-art query processing
solutions under differential privacy, by large margins.Comment: VLDB201
TRDB—The Tandem Repeats Database
Tandem repeats in DNA have been under intensive study for many years, first, as a consequence of their usefulness as genomic markers and DNA fingerprints and more recently as their role in human disease and regulatory processes has become apparent. The Tandem Repeats Database (TRDB) is a public repository of information on tandem repeats in genomic DNA. It contains a variety of tools for repeat analysis, including the Tandem Repeats Finder program, query and filtering capabilities, repeat clustering, polymorphism prediction, PCR primer selection, data visualization and data download in a variety of formats. In addition, TRDB serves as a centralized research workbench. It provides user storage space and permits collaborators to privately share their data and analysis. TRDB is available at
PRIVACY PRESERVING DATA MINING FOR NUMERICAL MATRICES, SOCIAL NETWORKS, AND BIG DATA
Motivated by increasing public awareness of possible abuse of confidential information, which is considered as a significant hindrance to the development of e-society, medical and financial markets, a privacy preserving data mining framework is presented so that data owners can carefully process data in order to preserve confidential information and guarantee information functionality within an acceptable boundary.
First, among many privacy-preserving methodologies, as a group of popular techniques for achieving a balance between data utility and information privacy, a class of data perturbation methods add a noise signal, following a statistical distribution, to an original numerical matrix. With the help of analysis in eigenspace of perturbed data, the potential privacy vulnerability of a popular data perturbation is analyzed in the presence of very little information leakage in privacy-preserving databases. The vulnerability to very little data leakage is theoretically proved and experimentally illustrated.
Second, in addition to numerical matrices, social networks have played a critical role in modern e-society. Security and privacy in social networks receive a lot of attention because of recent security scandals among some popular social network service providers. So, the need to protect confidential information from being disclosed motivates us to develop multiple privacy-preserving techniques for social networks.
Affinities (or weights) attached to edges are private and can lead to personal security leakage. To protect privacy of social networks, several algorithms are proposed, including Gaussian perturbation, greedy algorithm, and probability random walking algorithm. They can quickly modify original data in a large-scale situation, to satisfy different privacy requirements.
Third, the era of big data is approaching on the horizon in the industrial arena and academia, as the quantity of collected data is increasing in an exponential fashion. Three issues are studied in the age of big data with privacy preservation, obtaining a high confidence about accuracy of any specific differentially private queries, speedily and accurately updating a private summary of a binary stream with I/O-awareness, and launching a mutual private information retrieval for big data. All three issues are handled by two core backbones, differential privacy and the Chernoff Bound
Continuous Release of Data Streams under both Centralized and Local Differential Privacy
In this paper, we study the problem of publishing a stream of real-valued
data satisfying differential privacy (DP). One major challenge is that the
maximal possible value can be quite large; thus it is necessary to estimate a
threshold so that numbers above it are truncated to reduce the amount of noise
that is required to all the data. The estimation must be done based on the data
in a private fashion. We develop such a method that uses the Exponential
Mechanism with a quality function that approximates well the utility goal while
maintaining a low sensitivity. Given the threshold, we then propose a novel
online hierarchical method and several post-processing techniques.
Building on these ideas, we formalize the steps into a framework for private
publishing of stream data. Our framework consists of three components: a
threshold optimizer that privately estimates the threshold, a perturber that
adds calibrated noises to the stream, and a smoother that improves the result
using post-processing. Within our framework, we design an algorithm satisfying
the more stringent setting of DP called local DP (LDP). To our knowledge, this
is the first LDP algorithm for publishing streaming data. Using four real-world
datasets, we demonstrate that our mechanism outperforms the state-of-the-art by
a factor of 6-10 orders of magnitude in terms of utility (measured by the mean
squared error of answering a random range query)
Novel Approaches to Preserving Utility in Privacy Enhancing Technologies
Significant amount of individual information are being collected and analyzed today through a wide variety of applications across different industries. While pursuing better utility by discovering
knowledge from the data, an individual’s privacy may be compromised during an analysis: corporate networks monitor their online behavior, advertising companies collect and share their private
information, and cybercriminals cause financial damages through security breaches. To this end,
the data typically goes under certain anonymization techniques, e.g., CryptoPAn [Computer Networks’04], which replaces real IP addresses with prefix-preserving pseudonyms, or Differentially
Private (DP) [ICALP’06] techniques which modify the answer to a query by adding a zero-mean
noise distributed according to, e.g., a Laplace distribution. Unfortunately, most such techniques
either are vulnerable to adversaries with prior knowledge, e.g., some network flows in the data, or
require heavy data sanitization or perturbation, both of which may result in a significant loss of data
utility. Therefore, the fundamental trade-off between privacy and utility (i.e., analysis accuracy) has
attracted significant attention in various settings [ICALP’06, ACM CCS’14]. In line with this track
of research, in this dissertation we aim to build utility-maximized and privacy-preserving tools for
Internet communications. Such tools can be employed not only by dissidents and whistleblowers,
but also by ordinary Internet users on a daily basis. To this end, we combine the development of
practical systems with rigorous theoretical analysis, and incorporate techniques from various disciplines such as computer networking, cryptography, and statistical analysis. During the research,
we proposed three different frameworks in some well-known settings outlined in the following.
First, we propose The Multi-view Approach to preserve both privacy and utility in network trace
anonymization, Second, The R2DP Approach which is a novel technique on differentially private
mechanism design with maximized utility, and Third, The DPOD Approach that is a novel framework on privacy preserving Anomaly detection in the outsourcing setting
CoVault: A Secure Analytics Platform
Analytics on personal data, such as individuals' mobility, financial, and
health data can be of significant benefit to society. Such data is already
collected by smartphones, apps and services today, but liberal societies have
so far refrained from making it available for large-scale analytics. Arguably,
this is due at least in part to the lack of an analytics platform that can
secure data through transparent, technical means (ideally with decentralized
trust), enforce source policies, handle millions of distinct data sources, and
run queries on billions of records with acceptable query latencies. To bridge
this gap, we present an analytics platform called CoVault which combines secure
multi-party computation (MPC) with trusted execution environment (TEE)-based
delegation of trust to be able execute approved queries on encrypted data
contributed by individuals within a datacenter to achieve the above properties.
We show that CoVault scales well despite the high cost of MPC. For example,
CoVault can process data relevant to epidemic analytics for a country of 80M
people (about 11.85B data records/day) on a continuous basis using a core pair
for every 20,000 people. Compared to a state-of-the-art MPC-based platform,
CoVault can process queries between 7 to over 100 times faster, as well as
scale to many sources and big data.Comment: 13 pages, 6 figure
Vertical Federated Learning
Vertical Federated Learning (VFL) is a federated learning setting where
multiple parties with different features about the same set of users jointly
train machine learning models without exposing their raw data or model
parameters. Motivated by the rapid growth in VFL research and real-world
applications, we provide a comprehensive review of the concept and algorithms
of VFL, as well as current advances and challenges in various aspects,
including effectiveness, efficiency, and privacy. We provide an exhaustive
categorization for VFL settings and privacy-preserving protocols and
comprehensively analyze the privacy attacks and defense strategies for each
protocol. In the end, we propose a unified framework, termed VFLow, which
considers the VFL problem under communication, computation, privacy, and
effectiveness constraints. Finally, we review the most recent advances in
industrial applications, highlighting open challenges and future directions for
VFL
- …