61 research outputs found
Quantifying Privacy Loss of Human Mobility Graph Topology
Human mobility is often represented as a mobility network, or graph, with nodes representing places of significance which an individual visits, such as their home, work, places of social amenity, etc., and edge weights corresponding to probability estimates of movements between these places. Previous research has shown that individuals can be identified by a small number of geolocated nodes in their mobility network, rendering mobility trace anonymization a hard task. In this paper we build on prior work and demonstrate that even when all location and timestamp information is removed from nodes, the graph topology of an individual mobility network itself is often uniquely identifying. Further, we observe that a mobility network is often unique, even when only a small number of the most popular nodes and edges are considered. We evaluate our approach using a large dataset of cell-tower location traces from 1 500 smartphone handsets with a mean duration of 430 days. We process the data to derive the top−N places visited by the device in the trace, and find that 93% of traces have a unique top−10 mobility network, and all traces are unique when considering top−15 mobility networks. Since mobility patterns, and therefore mobility networks for an individual, vary over time, we use graph kernel distance functions, to determine whether two mobility networks, taken at different points in time, represent the same individual. We then show that our distance metrics, while imperfect predictors, perform significantly better than a random strategy and therefore our approach represents a significant loss in privacy
Protecting privacy of semantic trajectory
The growing ubiquity of GPS-enabled devices in everyday life has made large-scale collection of trajectories feasible, providing ever-growing opportunities for human movement analysis. However, publishing this vulnerable data is accompanied by increasing concerns about individuals’ geoprivacy. This thesis has two objectives: (1) propose a privacy protection framework for semantic trajectories and (2) develop a Python toolbox in ArcGIS Pro environment for non-expert users to enable them to anonymize trajectory data. The former aims to prevent users’ re-identification when knowing the important locations or any random spatiotemporal points of users by swapping their important locations to new locations with the same semantics and unlinking the users from their trajectories. This is accomplished by converting GPS points into sequences of visited meaningful locations and moves and integrating several anonymization techniques. The second component of this thesis implements privacy protection in a way that even users without deep knowledge of anonymization and coding skills can anonymize their data by offering an all-in-one toolbox. By proposing and implementing this framework and toolbox, we hope that trajectory privacy is better protected in research
Anonymization of Event Logs for Network Security Monitoring
A managed security service provider (MSSP) must collect security event logs from
their customers’ network for monitoring and cybersecurity protection. These logs
need to be processed by the MSSP before displaying it to the security operation
center (SOC) analysts. The employees generate event logs during their working hours
at the customers’ site. One challenge is that collected event logs consist of personally
identifiable information (PII) data; visible in clear text to the SOC analysts or any
user with access to the SIEM platform.
We explore how pseudonymization can be applied to security event logs to help
protect individuals’ identities from the SOC analysts while preserving data utility
when possible. We compare the impact of using different pseudonymization functions
on sensitive information or PII. Non-deterministic methods provide higher level of
privacy but reduced utility of the data.
Our contribution in this thesis is threefold. First, we study available architectures
with different threat models, including their strengths and weaknesses. Second, we
study pseudonymization functions and their application to PII fields; we benchmark
them individually, as well as in our experimental platform. Last, we obtain valuable
feedbacks and lessons from SOC analysts based on their experience.
Existing works[43, 44, 48, 39] are generally restricting to the anonymization of
the IP traces, which is only one part of the SOC analysts’ investigation of PCAP
files inspection. In one of the closest work[47], the authors provide useful, practical
anonymization methods for the IP addresses, ports, and raw logs
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method
Introduction: The amount of data generated by original research is growing
exponentially. Publicly releasing them is recommended to comply with the Open
Science principles. However, data collected from human participants cannot be
released as-is without raising privacy concerns. Fully synthetic data represent
a promising answer to this challenge. This approach is explored by the French
Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in
the form of a synthetic data generation framework based on Classification and
Regression Trees and an original distance-based filtering. The goal of this
work was to develop a refined version of this framework and to assess its
risk-utility profile with empirical and formal tools, including novel ones
developed for the purpose of this evaluation.Materials and Methods: Our
synthesis framework consists of four successive steps, each of which is
designed to prevent specific risks of disclosure. We assessed its performance
by applying two or more of these steps to a rich epidemiological dataset.
Privacy and utility metrics were computed for each of the resulting synthetic
datasets, which were further assessed using machine learning
approaches.Results: Computed metrics showed a satisfactory level of protection
against attribute disclosure attacks for each synthetic dataset, especially
when the full framework was used. Membership disclosure attacks were formally
prevented without significantly altering the data. Machine learning approaches
showed a low risk of success for simulated singling out and linkability
attacks. Distributional and inferential similarity with the original data were
high with all datasets.Discussion: This work showed the technical feasibility
of generating publicly releasable synthetic data using a multi-step framework.
Formal and empirical tools specifically developed for this demonstration are a
valuable contribution to this field. Further research should focus on the
extension and validation of these tools, in an effort to specify the intrinsic
qualities of alternative data synthesis methods.Conclusion: By successfully
assessing the quality of data produced using a novel multi-step synthetic data
generation framework, we showed the technical and conceptual soundness of the
Open-CESP initiative, which seems ripe for full-scale implementation
When the signal is in the noise: Exploiting Diffix's Sticky Noise
Anonymized data is highly valuable to both businesses and researchers. A
large body of research has however shown the strong limits of the
de-identification release-and-forget model, where data is anonymized and
shared. This has led to the development of privacy-preserving query-based
systems. Based on the idea of "sticky noise", Diffix has been recently proposed
as a novel query-based mechanism satisfying alone the EU Article~29 Working
Party's definition of anonymization. According to its authors, Diffix adds less
noise to answers than solutions based on differential privacy while allowing
for an unlimited number of queries.
This paper presents a new class of noise-exploitation attacks, exploiting the
noise added by the system to infer private information about individuals in the
dataset. Our first differential attack uses samples extracted from Diffix in a
likelihood ratio test to discriminate between two probability distributions. We
show that using this attack against a synthetic best-case dataset allows us to
infer private information with 89.4% accuracy using only 5 attributes. Our
second cloning attack uses dummy conditions that conditionally strongly affect
the output of the query depending on the value of the private attribute. Using
this attack on four real-world datasets, we show that we can infer private
attributes of at least 93% of the users in the dataset with accuracy between
93.3% and 97.1%, issuing a median of 304 queries per user. We show how to
optimize this attack, targeting 55.4% of the users and achieving 91.7%
accuracy, using a maximum of only 32 queries per user.
Our attacks demonstrate that adding data-dependent noise, as done by Diffix,
is not sufficient to prevent inference of private attributes. We furthermore
argue that Diffix alone fails to satisfy Art. 29 WP's definition of
anonymization. [...
Recommended from our members
Data Summarizations for Scalable, Robust and Privacy-Aware Learning in High Dimensions
The advent of large-scale datasets has offered unprecedented amounts of information for building statistically powerful machines, but, at the same time, also introduced a remarkable computational challenge: how can we efficiently process massive data? This thesis presents a suite of data reduction methods that make learning algorithms scale on large datasets, via extracting a succinct model-specific representation that summarizes the
full data collection—a coreset. Our frameworks support by design datasets of arbitrary dimensionality, and can be used for general purpose Bayesian inference under real-world constraints, including privacy preservation and robustness to outliers, encompassing diverse uncertainty-aware data analysis tasks, such as density estimation, classification
and regression.
We motivate the necessity for novel data reduction techniques in the first place by developing a reidentification attack on coarsened representations of private behavioural data. Analysing longitudinal records of human mobility, we detect privacy-revealing structural patterns, that remain preserved in reduced graph representations of individuals’ information with manageable size. These unique patterns enable mounting linkage attacks via structural similarity computations on longitudinal mobility traces, revealing an overlooked, yet existing, privacy threat.
We then propose a scalable variational inference scheme for approximating posteriors on large datasets via learnable weighted pseudodata, termed pseudocoresets. We show that the use of pseudodata enables overcoming the constraints on minimum summary size for given approximation quality, that are imposed on all existing Bayesian coreset constructions due to data dimensionality. Moreover, it allows us to develop a scheme for pseudocoresets-based summarization that satisfies the standard framework of differential privacy by construction; in this way, we can release reduced size privacy-preserving representations for sensitive datasets that are amenable to arbitrary post-processing.
Subsequently, we consider summarizations for large-scale Bayesian inference in scenarios when observed datapoints depart from the statistical assumptions of our model. Using robust divergences, we develop a method for constructing coresets resilient to model misspecification. Crucially, this method is able to automatically discard outliers from the generated data summaries. Thus we deliver robustified scalable representations
for inference, that are suitable for applications involving contaminated and unreliable data sources.
We demonstrate the performance of proposed summarization techniques on multiple parametric statistical models, and diverse simulated and real-world datasets, from music genre features to hospital readmission records, considering a wide range of data dimensionalities.Nokia Bell Labs,
Lundgren Fund,
Darwin College, University of Cambridge
Department of Computer Science & Technology, University of Cambridg
- …