112 research outputs found
Re-Identification Attacks – A Systematic Literature Review
The publication of increasing amounts of anonymised open source data has resulted in a worryingly rising number of successful re-identification attacks. This has a number of privacy and security implications both on an individual and corporate level. This paper uses a Systematic Literature Review to investigate the depth and extent of this problem as reported in peer reviewed literature. Using a detailed protocol ,seven research portals were explored, 10,873 database entries were searched, from which a subset of 220 papers were selected for further review. From this total, 55 papers were selected as being within scope and to be included in the final review. The main review findings are that 72.7% of all successful re-identification attacks have taken place since 2009. Most attacks use multiple datasets. The majority of them have taken place on global datasets such as social networking data, and have been conducted by US based researchers. Furthermore, the number of datasets can be used as an attribute. Because privacy breaches have security, policy and legal implications (e.g. data protection, Safe Harbor etc.), the work highlights the need for new and improved anonymisation techniques or indeed, a fresh approach to open source publishing
Mining Frequent Graph Patterns with Differential Privacy
Discovering frequent graph patterns in a graph database offers valuable
information in a variety of applications. However, if the graph dataset
contains sensitive data of individuals such as mobile phone-call graphs and
web-click graphs, releasing discovered frequent patterns may present a threat
to the privacy of individuals. {\em Differential privacy} has recently emerged
as the {\em de facto} standard for private data analysis due to its provable
privacy guarantee. In this paper we propose the first differentially private
algorithm for mining frequent graph patterns.
We first show that previous techniques on differentially private discovery of
frequent {\em itemsets} cannot apply in mining frequent graph patterns due to
the inherent complexity of handling structural information in graphs. We then
address this challenge by proposing a Markov Chain Monte Carlo (MCMC) sampling
based algorithm. Unlike previous work on frequent itemset mining, our
techniques do not rely on the output of a non-private mining algorithm.
Instead, we observe that both frequent graph pattern mining and the guarantee
of differential privacy can be unified into an MCMC sampling framework. In
addition, we establish the privacy and utility guarantee of our algorithm and
propose an efficient neighboring pattern counting technique as well.
Experimental results show that the proposed algorithm is able to output
frequent patterns with good precision
GLOVE: towards privacy-preserving publishing of record-level-truthful mobile phone trajectories
Datasets of mobile phone trajectories collected by network operators offer an unprecedented opportunity to discover new knowledge from the activity of large populations of millions. However, publishing such trajectories also raises significant privacy concerns, as they contain personal data in the form of individual movement patterns. Privacy risks induce network operators to enforce restrictive confidential agreements in the rare occasions when they grant access to collected trajectories, whereas a less involved circulation of these data would fuel research and enable reproducibility in many disciplines. In this work, we contribute a building block toward the design of privacy-preserving datasets of mobile phone trajectories that are truthful at the record level. We present GLOVE, an algorithm that implements k-anonymity, hence solving the crucial unicity problem that affects this type of data while ensuring that the anonymized trajectories correspond to real-life users. GLOVE builds on original insights about the root causes behind the undesirable unicity of mobile phone trajectories, and leverages generalization and suppression to remove them. Proof-of-concept validations with large-scale real-world datasets demonstrate that the approach adopted by GLOVE allows preserving a substantial level of accuracy in the data, higher than that granted by previous methodologies.This work was supported by the Atracción de Talento Investigador program of the Comunidad de Madrid under Grant No. 2019-T1/TIC-16037 NetSense
CASTLEGUARD : anonymised data streams with guaranteed differential privacy
Data streams are commonly used by data controllers to outsource the processing of real-time data to third-party data processors. Data protection legislation and best practice in data management support the view that data controllers are responsible for providing a guarantee of privacy for user data contained within published data streams. Continuously Anonymising STreaming data via adaptive cLustEring (CASTLE) is an established method for anonymising data streams with a guarantee of k-anonymity. However, k-anonymity has been shown to be a weak privacy guarantee that has vulnerabilities in practical applications. In this paper we propose Continuously Anonymising STreaming data via adaptive cLustEring with GUAR-anteed Differential privacy (CASTLEGUARD), a data stream anonymisation algorithm that provides a reliable guarantee of k-anonymity, l-diversity and differential privacy to data subjects. We analyse CASTLEGUARD to show that, through safe k-anonymisation and β-sampling, the proposed approach satisfies differentially private k-anonymity. Further, we demonstrate the efficacy of the approach in the context of machine learning, presenting experimental analysis to demonstrate that it can be used to protect the individual privacy of users whilst maintaining the utility of a data stream
Distribution-Agnostic Database De-Anonymization Under Synchronization Errors
There has recently been an increased scientific interest in the
de-anonymization of users in anonymized databases containing user-level
microdata via multifarious matching strategies utilizing publicly available
correlated data. Existing literature has either emphasized practical aspects
where underlying data distribution is not required, with limited or no
theoretical guarantees, or theoretical aspects with the assumption of complete
availability of underlying distributions. In this work, we take a step towards
reconciling these two lines of work by providing theoretical guarantees for the
de-anonymization of random correlated databases without prior knowledge of data
distribution. Motivated by time-indexed microdata, we consider database
de-anonymization under both synchronization errors (column repetitions) and
obfuscation (noise). By modifying the previously used replica detection
algorithm to accommodate for the unknown underlying distribution, proposing a
new seeded deletion detection algorithm, and employing statistical and
information-theoretic tools, we derive sufficient conditions on the database
growth rate for successful matching. Our findings demonstrate that a
double-logarithmic seed size relative to row size ensures successful deletion
detection. More importantly, we show that the derived sufficient conditions are
the same as in the distribution-aware setting, negating any asymptotic loss of
performance due to unknown underlying distributions
Fast Differentially Private Matrix Factorization
Differentially private collaborative filtering is a challenging task, both in
terms of accuracy and speed. We present a simple algorithm that is provably
differentially private, while offering good performance, using a novel
connection of differential privacy to Bayesian posterior sampling via
Stochastic Gradient Langevin Dynamics. Due to its simplicity the algorithm
lends itself to efficient implementation. By careful systems design and by
exploiting the power law behavior of the data to maximize CPU cache bandwidth
we are able to generate 1024 dimensional models at a rate of 8.5 million
recommendations per second on a single PC
- …