14 research outputs found
Private Graph Data Release: A Survey
The application of graph analytics to various domains have yielded tremendous
societal and economical benefits in recent years. However, the increasingly
widespread adoption of graph analytics comes with a commensurate increase in
the need to protect private information in graph databases, especially in light
of the many privacy breaches in real-world graph data that was supposed to
preserve sensitive information. This paper provides a comprehensive survey of
private graph data release algorithms that seek to achieve the fine balance
between privacy and utility, with a specific focus on provably private
mechanisms. Many of these mechanisms fall under natural extensions of the
Differential Privacy framework to graph data, but we also investigate more
general privacy formulations like Pufferfish Privacy that can deal with the
limitations of Differential Privacy. A wide-ranging survey of the applications
of private graph data release mechanisms to social networks, finance, supply
chain, health and energy is also provided. This survey paper and the taxonomy
it provides should benefit practitioners and researchers alike in the
increasingly important area of private graph data release and analysis
Publishing data from electronic health records while preserving privacy: a survey of algorithms
The dissemination of Electronic Health Records (EHRs) can be highly beneficial for a range of medical studies, spanning from clinical trials to epidemic control studies, but it must be performed in a way that preserves patients’ privacy. This is not straightforward, because the disseminated data need to be protected against several privacy threats, while remaining useful for subsequent analysis tasks. In this work, we present a survey of algorithms that have been proposed for publishing structured patient data, in a privacy-preserving way. We review more than 45 algorithms, derive insights on their operation, and highlight their advantages and disadvantages. We also provide a discussion of some promising directions for future research in this area
Data anonymization : algorithms, techniques and tools
Nos últimos anos, o volume de informação online tem vindo a crescer
exponencialmente. Os dados pessoais de cada indivíduo são utilizados de forma
contínua pelo governo, por empresas ou por indivíduos, com a finalidade de criar
dados estatísticos. Estes podem depois ser utilizados em campanhas de marketing,
na previsão de tendências futuras, na ajuda em investigações ao nível da ciência e da
medicina e muitos outros exemplos.
O maior problema com a utilização destes dados é que eles podem conter informação
sensível e informação que permita identificar um indivíduo, podendo causar graves
problemas a nível pessoal como, por exemplo, roubo de identidade, extração de
dinheiro, etc., dependendo dos dados divulgados.
Para resolver este problema existe a anonimização de dados. Esta tem como
finalidade alterar os dados de modo a ocultar informação sensível e que podem
permitir a identificação de um indivíduo, tornando-os menos precisos.
Uma das maiores dificuldades perante a anonimização de dados é que ao mesmo
tempo que se mantém a privacidade dos indivíduos, a utilidade dos dados deve
permanecer e, para isto, é necessário ter em atenção as técnicas e os algoritmos que
são utilizadas e a quantidade de vezes que estas são aplicadas.
Neste trabalho são estudadas as técnicas de anonimização mais comuns, como a
generalização, a supressão, a anatomização, a permutação e a perturbação e também
alguns dos algoritmos de anonimização mais conhecidos, como o k-anonimato e o l diversidade.
Para a avaliação e a aplicação destas técnicas e algoritmos foram utilizadas as
ferramentas open-source, ARX Data Anonymization Tool, UTD Anonymization
Toolbox e Amnesia. Utilizando a metodologia OSSpal foi também realizada a
avaliação de cada uma destas ferramentas.
A metodologia OSSpal tem como finalidade avaliar ferramentas open-source de forma
a ajudar os utilizadores e as organizações a encontrar as melhores, recorrendo a um
conjunto de categorias. No contexto desta tese, as categorias utilizadas foram a
funcionalidade, as características funcionais do software, o suporte e os serviços, a
documentação, os atributos da tecnologia do software, a comunidade e a adaptação
e o processo de desenvolvimento.
Nesta tese, o trabalho experimental realizado consistiu na avaliação das três
ferramentas de anonimização utilizando dois dataset reais. O UTD Anonymization
Toolbox só foi utilizado com um dos datasets, o de menor tamanho, porque esta
ferramenta requer a introdução manual dos elementos do dataset num ficheiro, o que
pode originar erros.
Na avaliação das ferramentas é possível verificar que o ARX Data Anonymization Tool
é a ferramenta que apresenta os dados de forma mais simples e que permite uma
melhor visualização por parte do utilizador. O Amnesia é fácil de utilizar pois mostra
ao utilizador todos os passos necessários para anonimizar um dataset, apesar de
mostrar alguns erros, porém, o UTD Anonymization Toolbox foi a ferramenta que
apresentou mais dificuldades na utilização devido ao facto de não ter uma interface
gráfica, mas também porque a introdução dos dados tem de ser feita de forma manual.
Após a avaliação experimental é possível concluir que o ARX Data Anonymization
Tool é a melhor ferramenta para ser usada na anonimização de dados, seguindo-se o
Amnesia e, por último o UTD Anonymization Toolbox
Processing of Erroneous and Unsafe Data
Statistical offices have to overcome many problems before they can publish reliable data. Two of these problems are examined in this thesis. The first problem is the occurrence of errors in the collected data. Due to these errors publication figures cannot be directly based on the collected data. Before publication the errors in the data have to be localised and corrected. In this thesis we focus on the localisation of errors in a mix of categorical and numerical data. The problem is formulated as a mathematical optimisation problem. Several new algorithms for solving this problem are proposed, and computational results of the most promising algorithms are compared to each other. The second problem that is examined in this thesis is the occurrence of unsafe data, i.e. data that would reveal too much sensitive information about individual respondents. Before publication of data, such unsafe data need to be protected. In the thesis we examine various aspects of the protection of unsafe data.Statistische bureaus dienen tal van problemen te overwinnen voordat zij de resultaten van hun onderzoeken kunnen publiceren. In het proefschrift wordt ingegaan op twee van deze problemen. Het eerste probleem is dat verzamelde gegevens foutief kunnen zijn. Door de mogelijke aanwezigheid van fouten in de gegevens moeten deze gegevens eerst worden gecontroleerd en indien nodig worden gecorrigeerd voordat tot publicatie van resultaten wordt overgegaan. In het proefschrift wordt vooral aandacht besteed aan het opsporen van de foutieve gegevens. Door te veronderstellen dat er zo min mogelijk fouten zijn gemaakt kan het opsporen van de foutieve waarden als een wiskundig optimaliseringsprobleem worden geformuleerd. In het proefschrift wordt een aantal methoden ontwikkeld om dit complexe probleem efficient op te lossen. Het tweede probleem dat in het proefschrift onderzocht wordt is dat geen gegevens gepubliceerd mogen worden die de privacy van individuele respondenten of kleine groepen respondenten schaden. Om gegevens van individuele of kleine groepen respondenten te beschermen moeten beveiligingsmaatregelen, zoals het niet publiceren van bepaalde informatie, worden getroffen. In het proefschrift wordt ingegaan op de wiskundige problemen die het beveiligen van gevoelige gegevens met zich mee brengt. Voor een aantal problemen, zoals het berekenen van het informatieverlies ten gevolge van het beveiligen van gevoelige gegevens en het minimaliseren van de informatie die niet gepubliceerd wordt, worden oplossingen beschreven
Privacy by Design in Data Mining
Privacy is ever-growing concern in our society: the lack of reliable privacy safeguards in many current services and devices is the basis of a diffusion that is often more limited than expected. Moreover, people feel reluctant to provide true personal data, unless it is absolutely necessary. Thus, privacy is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving sensitive information. Many recent research works have focused on the study of privacy protection: some of these studies aim at individual privacy, i.e., the protection of sensitive individual data, while others aim at corporate privacy, i.e., the protection of strategic information at organization level. Unfortunately, it is in- creasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze complex data which describes human activities in great detail and resolution. As a result anonymization simply cannot be accomplished by de-identification. In the last few years, several techniques for creating anonymous or obfuscated versions of data sets have been proposed, which essentially aim to find an acceptable trade-off between data privacy on the one hand and data utility on the other. So far, the common result obtained is that no general method exists which is capable of both dealing with “generic personal data” and preserving “generic analytical results”.
In this thesis we propose the design of technological frameworks to counter the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of data mining technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technol- ogy by design, so that the analysis incorporates the relevant privacy requirements from the start. Therefore, we propose the privacy-by-design paradigm that sheds a new light on the study of privacy protection: once specific assumptions are made about the sensitive data and the target mining queries that are to be answered with the data, it is conceivable to design a framework to: a) transform the source data into an anonymous version with a quantifiable privacy guarantee, and b) guarantee that the target mining queries can be answered correctly using the transformed data instead of the original ones.
This thesis investigates on two new research issues which arise in modern Data Mining and Data Privacy: individual privacy protection in data publishing while preserving specific data mining analysis, and corporate privacy protection in data mining outsourcing
Anonymization Techniques for Privacy-preserving Process Mining
Process Mining ermöglicht die Analyse von Event Logs. Jede Aktivität ist durch ein Event in einem Trace recorded, welcher jeweils einer Prozessinstanz entspricht. Traces können sensible Daten, z.B. über Patienten enthalten. Diese Dissertation adressiert Datenschutzrisiken für Trace Daten und Process Mining. Durch eine empirische Studie zum Re-Identifikations Risiko in öffentlichen Event Logs wird die hohe Gefahr aufgezeigt, aber auch weitere Risiken sind von Bedeutung. Anonymisierung ist entscheidend um Risiken zu adressieren, aber schwierig weil gleichzeitig die Verhaltensaspekte des Event Logs erhalten werden sollen. Dies führt zu einem Privacy-Utility-Trade-Off. Dieser wird durch neue Algorithmen wie SaCoFa und SaPa angegangen, die Differential Privacy garantieren und gleichzeitig Utility erhalten. PRIPEL ergänzt die anonymiserten Control-flows um Kontextinformationen und ermöglich so die Veröffentlichung von vollständigen, geschützten Logs. Mit PRETSA wird eine Algorithmenfamilie vorgestellt, die k-anonymity garantiert. Dafür werden privacy-verletztende Traces miteinander vereint, mit dem Ziel ein möglichst syntaktisch ähnliches Log zu erzeugen. Durch Experimente kann eine bessere Utility-Erhaltung gegenüber existierenden Lösungen aufgezeigt werden.Process mining analyzes business processes using event logs. Each activity execution is recorded as an event in a trace, representing a process instance's behavior. Traces often hold sensitive info like patient data. This thesis addresses privacy concerns arising from trace data and process mining. A re-identification risk study on public event logs reveals high risk, but other threats exist. Anonymization is vital to address these issues, yet challenging due to preserving behavioral aspects for analysis, leading to a privacy-utility trade-off. New algorithms, SaCoFa and SaPa, are introduced for trace anonymization using noise for differential privacy while maintaining utility. PRIPEL supplements anonymized control flows with trace contextual info for complete protected logs. For k-anonymity, the PRETSA algorithm family merges privacy-violating traces based on a prefix representation of the event log, maintaining syntactic similarity. Empirical evaluations demonstrate utility improvements over existing techniques
CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS
The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research
Dictionary of privacy, data protection and information security
The Dictionary of Privacy, Data Protection and Information Security explains the complex technical terms, legal concepts, privacy management techniques, conceptual matters and vocabulary that inform public debate about privacy.
The revolutionary and pervasive influence of digital technology affects numerous disciplines and sectors of society, and concerns about its potential threats to privacy are growing. With over a thousand terms meticulously set out, described and cross-referenced, this Dictionary enables productive discussion by covering the full range of fields accessibly and comprehensively. In the ever-evolving debate surrounding privacy, this Dictionary takes a longer view, transcending the details of today''s problems, technology, and the law to examine the wider principles that underlie privacy discourse.
Interdisciplinary in scope, this Dictionary is invaluable to students, scholars and researchers in law, technology and computing, cybersecurity, sociology, public policy and administration, and regulation. It is also a vital reference for diverse practitioners including data scientists, lawyers, policymakers and regulators