14,355 research outputs found
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Enabling Privacy-Preserving GWAS in Heterogeneous Human Populations
The projected increase of genotyping in the clinic and the rise of large
genomic databases has led to the possibility of using patient medical data to
perform genomewide association studies (GWAS) on a larger scale and at a lower
cost than ever before. Due to privacy concerns, however, access to this data is
limited to a few trusted individuals, greatly reducing its impact on biomedical
research. Privacy preserving methods have been suggested as a way of allowing
more people access to this precious data while protecting patients. In
particular, there has been growing interest in applying the concept of
differential privacy to GWAS results. Unfortunately, previous approaches for
performing differentially private GWAS are based on rather simple statistics
that have some major limitations. In particular, they do not correct for
population stratification, a major issue when dealing with the genetically
diverse populations present in modern GWAS. To address this concern we
introduce a novel computational framework for performing GWAS that tailors
ideas from differential privacy to protect private phenotype information, while
at the same time correcting for population stratification. This framework
allows us to produce privacy preserving GWAS results based on two of the most
commonly used GWAS statistics: EIGENSTRAT and linear mixed model (LMM) based
statistics. We test our differentially private statistics, PrivSTRAT and
PrivLMM, on both simulated and real GWAS datasets and find that they are able
to protect privacy while returning meaningful GWAS results.Comment: To be presented at RECOMB 201
Measuring Membership Privacy on Aggregate Location Time-Series
While location data is extremely valuable for various applications,
disclosing it prompts serious threats to individuals' privacy. To limit such
concerns, organizations often provide analysts with aggregate time-series that
indicate, e.g., how many people are in a location at a time interval, rather
than raw individual traces. In this paper, we perform a measurement study to
understand Membership Inference Attacks (MIAs) on aggregate location
time-series, where an adversary tries to infer whether a specific user
contributed to the aggregates.
We find that the volume of contributed data, as well as the regularity and
particularity of users' mobility patterns, play a crucial role in the attack's
success. We experiment with a wide range of defenses based on generalization,
hiding, and perturbation, and evaluate their ability to thwart the attack
vis-a-vis the utility loss they introduce for various mobility analytics tasks.
Our results show that some defenses fail across the board, while others work
for specific tasks on aggregate location time-series. For instance, suppressing
small counts can be used for ranking hotspots, data generalization for
forecasting traffic, hotspot discovery, and map inference, while sampling is
effective for location labeling and anomaly detection when the dataset is
sparse. Differentially private techniques provide reasonable accuracy only in
very specific settings, e.g., discovering hotspots and forecasting their
traffic, and more so when using weaker privacy notions like crowd-blending
privacy. Overall, our measurements show that there does not exist a unique
generic defense that can preserve the utility of the analytics for arbitrary
applications, and provide useful insights regarding the disclosure of sanitized
aggregate location time-series
- …