139 research outputs found
Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization
This work investigates the effectiveness of different pseudonymization
techniques, ranging from rule-based substitutions to using pre-trained Large
Language Models (LLMs), on a variety of datasets and models used for two widely
used NLP tasks: text classification and summarization. Our work provides
crucial insights into the gaps between original and anonymized data (focusing
on the pseudonymization technique) and model quality and fosters future
research into higher-quality anonymization techniques to better balance the
trade-offs between data protection and utility preservation. We make our code,
pseudonymized datasets, and downstream models publicly availableComment: 10 pages. Accepted for TrustNLP workshop at ACL202
When the signal is in the noise: Exploiting Diffix's Sticky Noise
Anonymized data is highly valuable to both businesses and researchers. A
large body of research has however shown the strong limits of the
de-identification release-and-forget model, where data is anonymized and
shared. This has led to the development of privacy-preserving query-based
systems. Based on the idea of "sticky noise", Diffix has been recently proposed
as a novel query-based mechanism satisfying alone the EU Article~29 Working
Party's definition of anonymization. According to its authors, Diffix adds less
noise to answers than solutions based on differential privacy while allowing
for an unlimited number of queries.
This paper presents a new class of noise-exploitation attacks, exploiting the
noise added by the system to infer private information about individuals in the
dataset. Our first differential attack uses samples extracted from Diffix in a
likelihood ratio test to discriminate between two probability distributions. We
show that using this attack against a synthetic best-case dataset allows us to
infer private information with 89.4% accuracy using only 5 attributes. Our
second cloning attack uses dummy conditions that conditionally strongly affect
the output of the query depending on the value of the private attribute. Using
this attack on four real-world datasets, we show that we can infer private
attributes of at least 93% of the users in the dataset with accuracy between
93.3% and 97.1%, issuing a median of 304 queries per user. We show how to
optimize this attack, targeting 55.4% of the users and achieving 91.7%
accuracy, using a maximum of only 32 queries per user.
Our attacks demonstrate that adding data-dependent noise, as done by Diffix,
is not sufficient to prevent inference of private attributes. We furthermore
argue that Diffix alone fails to satisfy Art. 29 WP's definition of
anonymization. [...
Protecting privacy of semantic trajectory
The growing ubiquity of GPS-enabled devices in everyday life has made large-scale collection of trajectories feasible, providing ever-growing opportunities for human movement analysis. However, publishing this vulnerable data is accompanied by increasing concerns about individualsâ geoprivacy. This thesis has two objectives: (1) propose a privacy protection framework for semantic trajectories and (2) develop a Python toolbox in ArcGIS Pro environment for non-expert users to enable them to anonymize trajectory data. The former aims to prevent usersâ re-identification when knowing the important locations or any random spatiotemporal points of users by swapping their important locations to new locations with the same semantics and unlinking the users from their trajectories. This is accomplished by converting GPS points into sequences of visited meaningful locations and moves and integrating several anonymization techniques. The second component of this thesis implements privacy protection in a way that even users without deep knowledge of anonymization and coding skills can anonymize their data by offering an all-in-one toolbox. By proposing and implementing this framework and toolbox, we hope that trajectory privacy is better protected in research
A holistic multi-purpose life logging framework
Die Paradigm des Life-Loggings verspricht durch den Vorschlag eines elektronisches GedÀchtnisses dem menschlichem GedÀchtnis eine komplementÀre Assistenz. Life-Logs sind Werkzeuge oder Systeme, die automatisch Ereignisse des Lebens des Benutzers aufnehmen. Im technischem Sinne sind es Systeme, die den Alltag durchdringen und kontinuierlich konzeptuelle Informationen aus der Umgebung des Benutzers
erfassen. Teile eines so gesammelten Datensatzes könnten aufbewahrt und fĂŒr die nĂ€chsten Generationen zugĂ€nglich gemacht werden. Einige Teile sind es wert zusĂ€tzlich auch noch mit der Gesellschaft geteilt zu werden, z.B. in sozialen Netzwerken. Vom Teilen solcher Informationen profitiert sowohl der Benutzer als
auch die Gesellschaft, beispielsweise durch die Verbesserung der sozialen Interaktion des Users, das ermöglichen neuer Gruppenverhaltensstudien usw. Anderseits, im Sinne der individuellen PrivatsphÀre, sind Life-log Informationen sehr sensibel und entsprechender Datenschutz sollte schon beim Design solcher Systeme in Betracht gezogen werden.
Momentan sind Life-Logs hauptsĂ€chlich fĂŒr den spezifischen Gebrauch als GedĂ€chtnisstĂŒtzen vorgesehen. Sie sind konfiguriert um nur mit einem vordefinierten Sensorset zu arbeiten. Das bedeutet sie sind nicht flexibel genug um neue Sensoren zu akzeptieren. Sensoren sind Kernkomponenten von Life-Logs und mit steigender Sensoranzahl wĂ€chst auch die Menge der Daten die fĂŒr die Erfassung verfĂŒgbar sind. ZusĂ€tzlich bietet die Anordnung von mehreren Sensordaten bessere qualitative und quantitative Informationen ĂŒber den Status und die Umgebung (Kontext) des Benutzers. Offenheit fĂŒr Sensoren wirkt sich also sowohl fĂŒr den User als auch fĂŒr die Gemeinschaft positiv aus, indem es Potential fĂŒr multidisziplinnĂ€re Studien bietet.
Zum Beispiel können Benutzer Sensoren konfigurieren um ihren Gesundheitszustand in einem gewissen Zeitraum zu ĂŒberwachen und das System danach Ă€ndern um es wieder als GedĂ€chtnisstĂŒtze zu verwenden.
In dieser Dissertation stelle ich ein Life-Log Framework vor, das offen fĂŒr die Erweiterung und Konfiguration von Sensoren ist. Die Offenheit und Erweiterbarkeit des Frameworks wird durch eine Sensorklassiffzierung und ein flexibles Model fĂŒr die Speicherung der Life-Log Informationen unterstĂŒtzt. Das Framework ermöglicht es den BenĂŒtzern ihre Life-logs mit anderen zu teilen und unterstĂŒtzt die notwendigen Merkmale vom Life Logging. Diese beinhalten Informationssuche (durch Annotation), langfristige digitale Erhaltung, digitales Vergessen, Sicherheit und Datenschutz.The paradigm of life-logging promises a complimentary assistance to the human memory by proposing an electronic memory. Life-logs are tools or systems, which automatically record users' life events in digital format. In a technical sense, they are pervasive tools or systems which continuously sense and capture contextual information from the user's environment. A dataset will be created from the collected
information and some records of this dataset are worth preserving in the long-term and enable others, in future generations, to access them. Additionally, some parts are worth sharing with society e.g. through social networks. Sharing this information with society benefits both users and society in many ways, such as augmenting users' social interaction, group behavior studies, etc. However, in terms of individual privacy, life-log information is very sensitive and during the design of such a system privacy and security should be taken into account.
Currently life-logs are designed for specific purposes such as memory augmentation, but they are not flexible enough to accept new sensors. This means that they have been configured to work only with a predefined set of sensors. Sensors are the core component of life-logs and increasing the number of sensors causes more data to be available for acquisition. Moreover a composition of multiple sensor data provides better qualitative and quantitative information about users' status and their environment (context). On the other hand, sensor openness benefits both users and communities by providing appropriate capabilities for multidisciplinary studies. For instance, users can configure sensors to monitor their health status for a specific period, after which they can change the system to use it for memory augmentation.
In this dissertation I propose a life-log framework which is open to extension and configuration of its sensors. Openness and extendibility, which makes the framework holistic and multi-purpose, is supported by a sensor classification and a flexible model for storing life-log information. The framework enables users to share their life-log information and supports required features for life logging. These features include digital forgetting, facilitating information retrieval (through annotation), long-term digital preservation, security and privacy
Anonymization of Event Logs for Network Security Monitoring
A managed security service provider (MSSP) must collect security event logs from
their customersâ network for monitoring and cybersecurity protection. These logs
need to be processed by the MSSP before displaying it to the security operation
center (SOC) analysts. The employees generate event logs during their working hours
at the customersâ site. One challenge is that collected event logs consist of personally
identifiable information (PII) data; visible in clear text to the SOC analysts or any
user with access to the SIEM platform.
We explore how pseudonymization can be applied to security event logs to help
protect individualsâ identities from the SOC analysts while preserving data utility
when possible. We compare the impact of using different pseudonymization functions
on sensitive information or PII. Non-deterministic methods provide higher level of
privacy but reduced utility of the data.
Our contribution in this thesis is threefold. First, we study available architectures
with different threat models, including their strengths and weaknesses. Second, we
study pseudonymization functions and their application to PII fields; we benchmark
them individually, as well as in our experimental platform. Last, we obtain valuable
feedbacks and lessons from SOC analysts based on their experience.
Existing works[43, 44, 48, 39] are generally restricting to the anonymization of
the IP traces, which is only one part of the SOC analystsâ investigation of PCAP
files inspection. In one of the closest work[47], the authors provide useful, practical
anonymization methods for the IP addresses, ports, and raw logs
Recommended from our members
Novel reversible text data de-identification techniques based on native data structures
Technological development in today's digital world has resulted in the collection and storage of large amounts of personal data. These data enable both direct services and non-direct activities, known as secondary use. The secondary use of data can improve decision-making, service experiences, and healthcare systems. However, the widespread reuse of personal data raises significant privacy and policy issues, especially for health- related information; these data may contain sensitive data, leading to privacy breaches if compromised. Legal systems establish laws to protect the privacy of personal data disclosed for secondary use. A well-known example is the General Data Protection Regulation (GDPR), which outlines a specific set of rules for sharing and storing personal data to protect individual privacy. The GDPR explicitly points to data de-identification, especially pseudonymization, as one measure that can help meet the requirements for the processing of personal data.
The literature on privacy preservation approaches has largely been developed in the field of data anonymization, where personal data are irreversibly removed or obfuscated and there is no means by which to recover an individual's identity if needed. By contrast, pseudonymization is a promising technique to protect privacy while enabling the recovery of de-identified data. Significantly, many existing approaches for pseudonymization were developed long before the GDPR requirements were established, and so they may fail to satisfy its provisions. Therefore, it is worthwhile to offer technical solutions to preserve privacy while supporting the legitimate use of data.
This thesis proposes a novel de-identification system for unstructured textual data, known as ARTPHIL, that generates de-identified data in compliance with the GDPR requirement for strong pseudonymization. The system was evaluated using 2014 i2b2 testing data. The proposed system achieved a recall of 96.93% in terms of detecting and encrypting personal health information, as specified under guidelines provided by the Health Insurance Portability and Accountability Act (HIPAA). The system used a novel and lightweight cryptography algorithm E-ART to encrypt personal data cost-effectively and without compromising security. The main novelty of the E-ART algorithm is the use of the reflection property of a balanced binary tree data structure as substitution method instead of complex and multiple iterations. The performance and security of the proposed algorithm were compared to two symmetric encryption algorithms: The Advanced Encryption Standard and Data Encryption Standard. The security analysis showed comparable results, but the performance analysis indicated that EâART had the shortest ciphertext and running time with comparable memory usage, which indicates the feasibility of using ARTPHIL for delay-sensitive or data-intensive application
Quantifying Privacy Loss of Human Mobility Graph Topology
Human mobility is often represented as a mobility network, or graph, with nodes representing places of significance which an individual visits, such as their home, work, places of social amenity, etc., and edge weights corresponding to probability estimates of movements between these places. Previous research has shown that individuals can be identified by a small number of geolocated nodes in their mobility network, rendering mobility trace anonymization a hard task. In this paper we build on prior work and demonstrate that even when all location and timestamp information is removed from nodes, the graph topology of an individual mobility network itself is often uniquely identifying. Further, we observe that a mobility network is often unique, even when only a small number of the most popular nodes and edges are considered. We evaluate our approach using a large dataset of cell-tower location traces from 1 500 smartphone handsets with a mean duration of 430 days. We process the data to derive the topâN places visited by the device in the trace, and find that 93% of traces have a unique topâ10 mobility network, and all traces are unique when considering topâ15 mobility networks. Since mobility patterns, and therefore mobility networks for an individual, vary over time, we use graph kernel distance functions, to determine whether two mobility networks, taken at different points in time, represent the same individual. We then show that our distance metrics, while imperfect predictors, perform significantly better than a random strategy and therefore our approach represents a significant loss in privacy
- âŠ