41,548 research outputs found
PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn
Preserving privacy of users is a key requirement of web-scale analytics and
reporting applications, and has witnessed a renewed focus in light of recent
data breaches and new regulations such as GDPR. We focus on the problem of
computing robust, reliable analytics in a privacy-preserving manner, while
satisfying product requirements. We present PriPeARL, a framework for
privacy-preserving analytics and reporting, inspired by differential privacy.
We describe the overall design and architecture, and the key modeling
components, focusing on the unique challenges associated with privacy,
coverage, utility, and consistency. We perform an experimental study in the
context of ads analytics and reporting at LinkedIn, thereby demonstrating the
tradeoffs between privacy and utility needs, and the applicability of
privacy-preserving mechanisms to real-world data. We also highlight the lessons
learned from the production deployment of our system at LinkedIn.Comment: Conference information: ACM International Conference on Information
and Knowledge Management (CIKM 2018
Privacy-Friendly Mobility Analytics using Aggregate Location Data
Location data can be extremely useful to study commuting patterns and
disruptions, as well as to predict real-time traffic volumes. At the same time,
however, the fine-grained collection of user locations raises serious privacy
concerns, as this can reveal sensitive information about the users, such as,
life style, political and religious inclinations, or even identities. In this
paper, we study the feasibility of crowd-sourced mobility analytics over
aggregate location information: users periodically report their location, using
a privacy-preserving aggregation protocol, so that the server can only recover
aggregates -- i.e., how many, but not which, users are in a region at a given
time. We experiment with real-world mobility datasets obtained from the
Transport For London authority and the San Francisco Cabs network, and present
a novel methodology based on time series modeling that is geared to forecast
traffic volumes in regions of interest and to detect mobility anomalies in
them. In the presence of anomalies, we also make enhanced traffic volume
predictions by feeding our model with additional information from correlated
regions. Finally, we present and evaluate a mobile app prototype, called
Mobility Data Donors (MDD), in terms of computation, communication, and energy
overhead, demonstrating the real-world deployability of our techniques.Comment: Published at ACM SIGSPATIAL 201
Privacy and accountability for location-based aggregate statistics
A significant and growing class of location-based mobile applications aggregate position data from individual devices at a server and compute aggregate statistics over these position streams. Because these devices can be linked to the movement of individuals, there is significant danger that the aggregate computation will violate the location privacy of individuals. This paper develops and evaluates PrivStats, a system for computing aggregate statistics over location data that simultaneously achieves two properties: first, provable guarantees on location privacy even in the face of any side information about users known to the server, and second, privacy-preserving accountability (i.e., protection against abusive clients uploading large amounts of spurious data). PrivStats achieves these properties using a new protocol for uploading and aggregating data anonymously as well as an efficient zero-knowledge proof of knowledge protocol we developed from scratch for accountability. We implemented our system on Nexus One smartphones and commodity servers. Our experimental results demonstrate that PrivStats is a practical system: computing a common aggregate (e.g., count) over the data of 10,000 clients takes less than 0.46 s at the server and the protocol has modest latency (0.6 s) to upload data from a Nexus phone. We also validated our protocols on real driver traces from the CarTel project.National Science Foundation (U.S.) (grant 0931550)National Science Foundation (U.S.) (grant 0716273
Learning Character Strings via Mastermind Queries, with a Case Study Involving mtDNA
We study the degree to which a character string, , leaks details about
itself any time it engages in comparison protocols with a strings provided by a
querier, Bob, even if those protocols are cryptographically guaranteed to
produce no additional information other than the scores that assess the degree
to which matches strings offered by Bob. We show that such scenarios allow
Bob to play variants of the game of Mastermind with so as to learn the
complete identity of . We show that there are a number of efficient
implementations for Bob to employ in these Mastermind attacks, depending on
knowledge he has about the structure of , which show how quickly he can
determine . Indeed, we show that Bob can discover using a number of
rounds of test comparisons that is much smaller than the length of , under
reasonable assumptions regarding the types of scores that are returned by the
cryptographic protocols and whether he can use knowledge about the distribution
that comes from. We also provide the results of a case study we performed
on a database of mitochondrial DNA, showing the vulnerability of existing
real-world DNA data to the Mastermind attack.Comment: Full version of related paper appearing in IEEE Symposium on Security
and Privacy 2009, "The Mastermind Attack on Genomic Data." This version
corrects the proofs of what are now Theorems 2 and 4
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
- …