Search CORE

139 research outputs found

Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization

Author: Chernodub Artem
Raheja Vipul
Yermilov Oleksandr
Publication venue
Publication date: 08/06/2023
Field of study

This work investigates the effectiveness of different pseudonymization techniques, ranging from rule-based substitutions to using pre-trained Large Language Models (LLMs), on a variety of datasets and models used for two widely used NLP tasks: text classification and summarization. Our work provides crucial insights into the gaps between original and anonymized data (focusing on the pseudonymization technique) and model quality and fosters future research into higher-quality anonymization techniques to better balance the trade-offs between data protection and utility preservation. We make our code, pseudonymized datasets, and downstream models publicly availableComment: 10 pages. Accepted for TrustNLP workshop at ACL202

arXiv.org e-Print Archive

When the signal is in the noise: Exploiting Diffix's Sticky Noise

Author: de Montjoye Yves-Alexandre
Gadotti Andrea
Houssiau Florimond
Livshits Benjamin
Rocher Luc
Publication venue
Publication date: 01/01/2019
Field of study

Anonymized data is highly valuable to both businesses and researchers. A large body of research has however shown the strong limits of the de-identification release-and-forget model, where data is anonymized and shared. This has led to the development of privacy-preserving query-based systems. Based on the idea of "sticky noise", Diffix has been recently proposed as a novel query-based mechanism satisfying alone the EU Article~29 Working Party's definition of anonymization. According to its authors, Diffix adds less noise to answers than solutions based on differential privacy while allowing for an unlimited number of queries. This paper presents a new class of noise-exploitation attacks, exploiting the noise added by the system to infer private information about individuals in the dataset. Our first differential attack uses samples extracted from Diffix in a likelihood ratio test to discriminate between two probability distributions. We show that using this attack against a synthetic best-case dataset allows us to infer private information with 89.4% accuracy using only 5 attributes. Our second cloning attack uses dummy conditions that conditionally strongly affect the output of the query depending on the value of the private attribute. Using this attack on four real-world datasets, we show that we can infer private attributes of at least 93% of the users in the dataset with accuracy between 93.3% and 97.1%, issuing a median of 304 queries per user. We show how to optimize this attack, targeting 55.4% of the users and achieving 91.7% accuracy, using a maximum of only 32 queries per user. Our attacks demonstrate that adding data-dependent noise, as done by Diffix, is not sufficient to prevent inference of private attributes. We furthermore argue that Diffix alone fails to satisfy Art. 29 WP's definition of anonymization. [...

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

DIAL UCLouvain

Protecting privacy of semantic trajectory

Author: Shourouni Roya
Publication venue
Publication date: 29/06/2021
Field of study

The growing ubiquity of GPS-enabled devices in everyday life has made large-scale collection of trajectories feasible, providing ever-growing opportunities for human movement analysis. However, publishing this vulnerable data is accompanied by increasing concerns about individuals’ geoprivacy. This thesis has two objectives: (1) propose a privacy protection framework for semantic trajectories and (2) develop a Python toolbox in ArcGIS Pro environment for non-expert users to enable them to anonymize trajectory data. The former aims to prevent users’ re-identification when knowing the important locations or any random spatiotemporal points of users by swapping their important locations to new locations with the same semantics and unlinking the users from their trajectories. This is accomplished by converting GPS points into sequences of visited meaningful locations and moves and integrating several anonymization techniques. The second component of this thesis implements privacy protection in a way that even users without deep knowledge of anonymization and coding skills can anonymize their data by offering an all-in-one toolbox. By proposing and implementing this framework and toolbox, we hope that trajectory privacy is better protected in research

Simon Fraser University Institutional Repository

ユウヨウセイヲコウリョシタプライバシホゴギジュツニカンスルケンキュウ

Author: ミモトトモアキ
三本知明
Publication venue
Publication date
Field of study

Osaka University Knowledge Archive

A holistic multi-purpose life logging framework

Author: Rawassizadeh Reza
Publication venue
Publication date: 01/01/2012
Field of study

Die Paradigm des Life-Loggings verspricht durch den Vorschlag eines elektronisches Gedächtnisses dem menschlichem Gedächtnis eine komplementäre Assistenz. Life-Logs sind Werkzeuge oder Systeme, die automatisch Ereignisse des Lebens des Benutzers aufnehmen. Im technischem Sinne sind es Systeme, die den Alltag durchdringen und kontinuierlich konzeptuelle Informationen aus der Umgebung des Benutzers erfassen. Teile eines so gesammelten Datensatzes könnten aufbewahrt und für die nächsten Generationen zugänglich gemacht werden. Einige Teile sind es wert zusätzlich auch noch mit der Gesellschaft geteilt zu werden, z.B. in sozialen Netzwerken. Vom Teilen solcher Informationen profitiert sowohl der Benutzer als auch die Gesellschaft, beispielsweise durch die Verbesserung der sozialen Interaktion des Users, das ermöglichen neuer Gruppenverhaltensstudien usw. Anderseits, im Sinne der individuellen Privatsphäre, sind Life-log Informationen sehr sensibel und entsprechender Datenschutz sollte schon beim Design solcher Systeme in Betracht gezogen werden. Momentan sind Life-Logs hauptsächlich für den spezifischen Gebrauch als Gedächtnisstützen vorgesehen. Sie sind konfiguriert um nur mit einem vordefinierten Sensorset zu arbeiten. Das bedeutet sie sind nicht flexibel genug um neue Sensoren zu akzeptieren. Sensoren sind Kernkomponenten von Life-Logs und mit steigender Sensoranzahl wächst auch die Menge der Daten die für die Erfassung verfügbar sind. Zusätzlich bietet die Anordnung von mehreren Sensordaten bessere qualitative und quantitative Informationen über den Status und die Umgebung (Kontext) des Benutzers. Offenheit für Sensoren wirkt sich also sowohl für den User als auch für die Gemeinschaft positiv aus, indem es Potential für multidisziplinnäre Studien bietet. Zum Beispiel können Benutzer Sensoren konfigurieren um ihren Gesundheitszustand in einem gewissen Zeitraum zu überwachen und das System danach ändern um es wieder als Gedächtnisstütze zu verwenden. In dieser Dissertation stelle ich ein Life-Log Framework vor, das offen für die Erweiterung und Konfiguration von Sensoren ist. Die Offenheit und Erweiterbarkeit des Frameworks wird durch eine Sensorklassiffzierung und ein flexibles Model für die Speicherung der Life-Log Informationen unterstützt. Das Framework ermöglicht es den Benützern ihre Life-logs mit anderen zu teilen und unterstützt die notwendigen Merkmale vom Life Logging. Diese beinhalten Informationssuche (durch Annotation), langfristige digitale Erhaltung, digitales Vergessen, Sicherheit und Datenschutz.The paradigm of life-logging promises a complimentary assistance to the human memory by proposing an electronic memory. Life-logs are tools or systems, which automatically record users' life events in digital format. In a technical sense, they are pervasive tools or systems which continuously sense and capture contextual information from the user's environment. A dataset will be created from the collected information and some records of this dataset are worth preserving in the long-term and enable others, in future generations, to access them. Additionally, some parts are worth sharing with society e.g. through social networks. Sharing this information with society benefits both users and society in many ways, such as augmenting users' social interaction, group behavior studies, etc. However, in terms of individual privacy, life-log information is very sensitive and during the design of such a system privacy and security should be taken into account. Currently life-logs are designed for specific purposes such as memory augmentation, but they are not flexible enough to accept new sensors. This means that they have been configured to work only with a predefined set of sensors. Sensors are the core component of life-logs and increasing the number of sensors causes more data to be available for acquisition. Moreover a composition of multiple sensor data provides better qualitative and quantitative information about users' status and their environment (context). On the other hand, sensor openness benefits both users and communities by providing appropriate capabilities for multidisciplinary studies. For instance, users can configure sensors to monitor their health status for a specific period, after which they can change the system to use it for memory augmentation. In this dissertation I propose a life-log framework which is open to extension and configuration of its sensors. Openness and extendibility, which makes the framework holistic and multi-purpose, is supported by a sensor classification and a flexible model for storing life-log information. The framework enables users to share their life-log information and supports required features for life logging. These features include digital forgetting, facilitating information retrieval (through annotation), long-term digital preservation, security and privacy

OTHES

Anonymization of Event Logs for Network Security Monitoring

Author: Rasic Alis
Publication venue
Publication date: 01/02/2020
Field of study

A managed security service provider (MSSP) must collect security event logs from their customers’ network for monitoring and cybersecurity protection. These logs need to be processed by the MSSP before displaying it to the security operation center (SOC) analysts. The employees generate event logs during their working hours at the customers’ site. One challenge is that collected event logs consist of personally identifiable information (PII) data; visible in clear text to the SOC analysts or any user with access to the SIEM platform. We explore how pseudonymization can be applied to security event logs to help protect individuals’ identities from the SOC analysts while preserving data utility when possible. We compare the impact of using different pseudonymization functions on sensitive information or PII. Non-deterministic methods provide higher level of privacy but reduced utility of the data. Our contribution in this thesis is threefold. First, we study available architectures with different threat models, including their strengths and weaknesses. Second, we study pseudonymization functions and their application to PII fields; we benchmark them individually, as well as in our experimental platform. Last, we obtain valuable feedbacks and lessons from SOC analysts based on their experience. Existing works[43, 44, 48, 39] are generally restricting to the anonymization of the IP traces, which is only one part of the SOC analysts’ investigation of PCAP files inspection. In one of the closest work[47], the authors provide useful, practical anonymization methods for the IP addresses, ports, and raw logs

Concordia University Research Repository

Recommended from our members

Novel reversible text data de-identification techniques based on native data structures

Author: Al-Abdullah Bayan
Publication venue
Publication date: 20/05/2022
Field of study

Technological development in today's digital world has resulted in the collection and storage of large amounts of personal data. These data enable both direct services and non-direct activities, known as secondary use. The secondary use of data can improve decision-making, service experiences, and healthcare systems. However, the widespread reuse of personal data raises significant privacy and policy issues, especially for health- related information; these data may contain sensitive data, leading to privacy breaches if compromised. Legal systems establish laws to protect the privacy of personal data disclosed for secondary use. A well-known example is the General Data Protection Regulation (GDPR), which outlines a specific set of rules for sharing and storing personal data to protect individual privacy. The GDPR explicitly points to data de-identification, especially pseudonymization, as one measure that can help meet the requirements for the processing of personal data. The literature on privacy preservation approaches has largely been developed in the field of data anonymization, where personal data are irreversibly removed or obfuscated and there is no means by which to recover an individual's identity if needed. By contrast, pseudonymization is a promising technique to protect privacy while enabling the recovery of de-identified data. Significantly, many existing approaches for pseudonymization were developed long before the GDPR requirements were established, and so they may fail to satisfy its provisions. Therefore, it is worthwhile to offer technical solutions to preserve privacy while supporting the legitimate use of data. This thesis proposes a novel de-identification system for unstructured textual data, known as ARTPHIL, that generates de-identified data in compliance with the GDPR requirement for strong pseudonymization. The system was evaluated using 2014 i2b2 testing data. The proposed system achieved a recall of 96.93% in terms of detecting and encrypting personal health information, as specified under guidelines provided by the Health Insurance Portability and Accountability Act (HIPAA). The system used a novel and lightweight cryptography algorithm E-ART to encrypt personal data cost-effectively and without compromising security. The main novelty of the E-ART algorithm is the use of the reflection property of a balanced binary tree data structure as substitution method instead of complex and multiple iterations. The performance and security of the proposed algorithm were compared to two symmetric encryption algorithms: The Advanced Encryption Standard and Data Encryption Standard. The security analysis showed comparable results, but the performance analysis indicated that E‐ART had the shortest ciphertext and running time with comparable memory usage, which indicates the feasibility of using ARTPHIL for delay-sensitive or data-intensive application

Sussex Research Online

Quantifying Privacy Loss of Human Mobility Graph Topology

Author: Beresford AR
Chan D
Manousakas D
Mascolo C
Sharma N
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/06/2018
Field of study

Human mobility is often represented as a mobility network, or graph, with nodes representing places of significance which an individual visits, such as their home, work, places of social amenity, etc., and edge weights corresponding to probability estimates of movements between these places. Previous research has shown that individuals can be identified by a small number of geolocated nodes in their mobility network, rendering mobility trace anonymization a hard task. In this paper we build on prior work and demonstrate that even when all location and timestamp information is removed from nodes, the graph topology of an individual mobility network itself is often uniquely identifying. Further, we observe that a mobility network is often unique, even when only a small number of the most popular nodes and edges are considered. We evaluate our approach using a large dataset of cell-tower location traces from 1 500 smartphone handsets with a mean duration of 430 days. We process the data to derive the top−N places visited by the device in the trace, and find that 93% of traces have a unique top−10 mobility network, and all traces are unique when considering top−15 mobility networks. Since mobility patterns, and therefore mobility networks for an individual, vary over time, we use graph kernel distance functions, to determine whether two mobility networks, taken at different points in time, represent the same individual. We then show that our distance metrics, while imperfect predictors, perform significantly better than a random strategy and therefore our approach represents a significant loss in privacy

UCL Discovery