14 research outputs found

    Machine Learning auf anonymisierten Daten

    Get PDF

    Private Graph Data Release: A Survey

    Full text link
    The application of graph analytics to various domains have yielded tremendous societal and economical benefits in recent years. However, the increasingly widespread adoption of graph analytics comes with a commensurate increase in the need to protect private information in graph databases, especially in light of the many privacy breaches in real-world graph data that was supposed to preserve sensitive information. This paper provides a comprehensive survey of private graph data release algorithms that seek to achieve the fine balance between privacy and utility, with a specific focus on provably private mechanisms. Many of these mechanisms fall under natural extensions of the Differential Privacy framework to graph data, but we also investigate more general privacy formulations like Pufferfish Privacy that can deal with the limitations of Differential Privacy. A wide-ranging survey of the applications of private graph data release mechanisms to social networks, finance, supply chain, health and energy is also provided. This survey paper and the taxonomy it provides should benefit practitioners and researchers alike in the increasingly important area of private graph data release and analysis

    Publishing data from electronic health records while preserving privacy: a survey of algorithms

    Get PDF
    The dissemination of Electronic Health Records (EHRs) can be highly beneficial for a range of medical studies, spanning from clinical trials to epidemic control studies, but it must be performed in a way that preserves patients’ privacy. This is not straightforward, because the disseminated data need to be protected against several privacy threats, while remaining useful for subsequent analysis tasks. In this work, we present a survey of algorithms that have been proposed for publishing structured patient data, in a privacy-preserving way. We review more than 45 algorithms, derive insights on their operation, and highlight their advantages and disadvantages. We also provide a discussion of some promising directions for future research in this area

    Data anonymization : algorithms, techniques and tools

    Get PDF
    Nos últimos anos, o volume de informação online tem vindo a crescer exponencialmente. Os dados pessoais de cada indivíduo são utilizados de forma contínua pelo governo, por empresas ou por indivíduos, com a finalidade de criar dados estatísticos. Estes podem depois ser utilizados em campanhas de marketing, na previsão de tendências futuras, na ajuda em investigações ao nível da ciência e da medicina e muitos outros exemplos. O maior problema com a utilização destes dados é que eles podem conter informação sensível e informação que permita identificar um indivíduo, podendo causar graves problemas a nível pessoal como, por exemplo, roubo de identidade, extração de dinheiro, etc., dependendo dos dados divulgados. Para resolver este problema existe a anonimização de dados. Esta tem como finalidade alterar os dados de modo a ocultar informação sensível e que podem permitir a identificação de um indivíduo, tornando-os menos precisos. Uma das maiores dificuldades perante a anonimização de dados é que ao mesmo tempo que se mantém a privacidade dos indivíduos, a utilidade dos dados deve permanecer e, para isto, é necessário ter em atenção as técnicas e os algoritmos que são utilizadas e a quantidade de vezes que estas são aplicadas. Neste trabalho são estudadas as técnicas de anonimização mais comuns, como a generalização, a supressão, a anatomização, a permutação e a perturbação e também alguns dos algoritmos de anonimização mais conhecidos, como o k-anonimato e o l diversidade. Para a avaliação e a aplicação destas técnicas e algoritmos foram utilizadas as ferramentas open-source, ARX Data Anonymization Tool, UTD Anonymization Toolbox e Amnesia. Utilizando a metodologia OSSpal foi também realizada a avaliação de cada uma destas ferramentas. A metodologia OSSpal tem como finalidade avaliar ferramentas open-source de forma a ajudar os utilizadores e as organizações a encontrar as melhores, recorrendo a um conjunto de categorias. No contexto desta tese, as categorias utilizadas foram a funcionalidade, as características funcionais do software, o suporte e os serviços, a documentação, os atributos da tecnologia do software, a comunidade e a adaptação e o processo de desenvolvimento. Nesta tese, o trabalho experimental realizado consistiu na avaliação das três ferramentas de anonimização utilizando dois dataset reais. O UTD Anonymization Toolbox só foi utilizado com um dos datasets, o de menor tamanho, porque esta ferramenta requer a introdução manual dos elementos do dataset num ficheiro, o que pode originar erros. Na avaliação das ferramentas é possível verificar que o ARX Data Anonymization Tool é a ferramenta que apresenta os dados de forma mais simples e que permite uma melhor visualização por parte do utilizador. O Amnesia é fácil de utilizar pois mostra ao utilizador todos os passos necessários para anonimizar um dataset, apesar de mostrar alguns erros, porém, o UTD Anonymization Toolbox foi a ferramenta que apresentou mais dificuldades na utilização devido ao facto de não ter uma interface gráfica, mas também porque a introdução dos dados tem de ser feita de forma manual. Após a avaliação experimental é possível concluir que o ARX Data Anonymization Tool é a melhor ferramenta para ser usada na anonimização de dados, seguindo-se o Amnesia e, por último o UTD Anonymization Toolbox

    Processing of Erroneous and Unsafe Data

    Get PDF
    Statistical offices have to overcome many problems before they can publish reliable data. Two of these problems are examined in this thesis. The first problem is the occurrence of errors in the collected data. Due to these errors publication figures cannot be directly based on the collected data. Before publication the errors in the data have to be localised and corrected. In this thesis we focus on the localisation of errors in a mix of categorical and numerical data. The problem is formulated as a mathematical optimisation problem. Several new algorithms for solving this problem are proposed, and computational results of the most promising algorithms are compared to each other. The second problem that is examined in this thesis is the occurrence of unsafe data, i.e. data that would reveal too much sensitive information about individual respondents. Before publication of data, such unsafe data need to be protected. In the thesis we examine various aspects of the protection of unsafe data.Statistische bureaus dienen tal van problemen te overwinnen voordat zij de resultaten van hun onderzoeken kunnen publiceren. In het proefschrift wordt ingegaan op twee van deze problemen. Het eerste probleem is dat verzamelde gegevens foutief kunnen zijn. Door de mogelijke aanwezigheid van fouten in de gegevens moeten deze gegevens eerst worden gecontroleerd en indien nodig worden gecorrigeerd voordat tot publicatie van resultaten wordt overgegaan. In het proefschrift wordt vooral aandacht besteed aan het opsporen van de foutieve gegevens. Door te veronderstellen dat er zo min mogelijk fouten zijn gemaakt kan het opsporen van de foutieve waarden als een wiskundig optimaliseringsprobleem worden geformuleerd. In het proefschrift wordt een aantal methoden ontwikkeld om dit complexe probleem efficient op te lossen. Het tweede probleem dat in het proefschrift onderzocht wordt is dat geen gegevens gepubliceerd mogen worden die de privacy van individuele respondenten of kleine groepen respondenten schaden. Om gegevens van individuele of kleine groepen respondenten te beschermen moeten beveiligingsmaatregelen, zoals het niet publiceren van bepaalde informatie, worden getroffen. In het proefschrift wordt ingegaan op de wiskundige problemen die het beveiligen van gevoelige gegevens met zich mee brengt. Voor een aantal problemen, zoals het berekenen van het informatieverlies ten gevolge van het beveiligen van gevoelige gegevens en het minimaliseren van de informatie die niet gepubliceerd wordt, worden oplossingen beschreven

    A Survey of Privacy Preserving Data Publishing using Generalization and Suppression

    Full text link

    Privacy by Design in Data Mining

    Get PDF
    Privacy is ever-growing concern in our society: the lack of reliable privacy safeguards in many current services and devices is the basis of a diffusion that is often more limited than expected. Moreover, people feel reluctant to provide true personal data, unless it is absolutely necessary. Thus, privacy is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving sensitive information. Many recent research works have focused on the study of privacy protection: some of these studies aim at individual privacy, i.e., the protection of sensitive individual data, while others aim at corporate privacy, i.e., the protection of strategic information at organization level. Unfortunately, it is in- creasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze complex data which describes human activities in great detail and resolution. As a result anonymization simply cannot be accomplished by de-identification. In the last few years, several techniques for creating anonymous or obfuscated versions of data sets have been proposed, which essentially aim to find an acceptable trade-off between data privacy on the one hand and data utility on the other. So far, the common result obtained is that no general method exists which is capable of both dealing with “generic personal data” and preserving “generic analytical results”. In this thesis we propose the design of technological frameworks to counter the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of data mining technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technol- ogy by design, so that the analysis incorporates the relevant privacy requirements from the start. Therefore, we propose the privacy-by-design paradigm that sheds a new light on the study of privacy protection: once specific assumptions are made about the sensitive data and the target mining queries that are to be answered with the data, it is conceivable to design a framework to: a) transform the source data into an anonymous version with a quantifiable privacy guarantee, and b) guarantee that the target mining queries can be answered correctly using the transformed data instead of the original ones. This thesis investigates on two new research issues which arise in modern Data Mining and Data Privacy: individual privacy protection in data publishing while preserving specific data mining analysis, and corporate privacy protection in data mining outsourcing

    Anonymization Techniques for Privacy-preserving Process Mining

    Get PDF
    Process Mining ermöglicht die Analyse von Event Logs. Jede Aktivität ist durch ein Event in einem Trace recorded, welcher jeweils einer Prozessinstanz entspricht. Traces können sensible Daten, z.B. über Patienten enthalten. Diese Dissertation adressiert Datenschutzrisiken für Trace Daten und Process Mining. Durch eine empirische Studie zum Re-Identifikations Risiko in öffentlichen Event Logs wird die hohe Gefahr aufgezeigt, aber auch weitere Risiken sind von Bedeutung. Anonymisierung ist entscheidend um Risiken zu adressieren, aber schwierig weil gleichzeitig die Verhaltensaspekte des Event Logs erhalten werden sollen. Dies führt zu einem Privacy-Utility-Trade-Off. Dieser wird durch neue Algorithmen wie SaCoFa und SaPa angegangen, die Differential Privacy garantieren und gleichzeitig Utility erhalten. PRIPEL ergänzt die anonymiserten Control-flows um Kontextinformationen und ermöglich so die Veröffentlichung von vollständigen, geschützten Logs. Mit PRETSA wird eine Algorithmenfamilie vorgestellt, die k-anonymity garantiert. Dafür werden privacy-verletztende Traces miteinander vereint, mit dem Ziel ein möglichst syntaktisch ähnliches Log zu erzeugen. Durch Experimente kann eine bessere Utility-Erhaltung gegenüber existierenden Lösungen aufgezeigt werden.Process mining analyzes business processes using event logs. Each activity execution is recorded as an event in a trace, representing a process instance's behavior. Traces often hold sensitive info like patient data. This thesis addresses privacy concerns arising from trace data and process mining. A re-identification risk study on public event logs reveals high risk, but other threats exist. Anonymization is vital to address these issues, yet challenging due to preserving behavioral aspects for analysis, leading to a privacy-utility trade-off. New algorithms, SaCoFa and SaPa, are introduced for trace anonymization using noise for differential privacy while maintaining utility. PRIPEL supplements anonymized control flows with trace contextual info for complete protected logs. For k-anonymity, the PRETSA algorithm family merges privacy-violating traces based on a prefix representation of the event log, maintaining syntactic similarity. Empirical evaluations demonstrate utility improvements over existing techniques

    CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

    Get PDF
    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research

    Dictionary of privacy, data protection and information security

    Get PDF
    The Dictionary of Privacy, Data Protection and Information Security explains the complex technical terms, legal concepts, privacy management techniques, conceptual matters and vocabulary that inform public debate about privacy. The revolutionary and pervasive influence of digital technology affects numerous disciplines and sectors of society, and concerns about its potential threats to privacy are growing. With over a thousand terms meticulously set out, described and cross-referenced, this Dictionary enables productive discussion by covering the full range of fields accessibly and comprehensively. In the ever-evolving debate surrounding privacy, this Dictionary takes a longer view, transcending the details of today''s problems, technology, and the law to examine the wider principles that underlie privacy discourse. Interdisciplinary in scope, this Dictionary is invaluable to students, scholars and researchers in law, technology and computing, cybersecurity, sociology, public policy and administration, and regulation. It is also a vital reference for diverse practitioners including data scientists, lawyers, policymakers and regulators
    corecore