1,267 research outputs found

    A Taxonomy of Privacy-Preserving Record Linkage Techniques

    Get PDF
    The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity resolution, this process has attracted interest from researchers in fields such as databases and data warehousing, data mining, information systems, and machine learning. Record linkage has various challenges, including scalability to large databases, accurate matching and classification, and privacy and confidentiality. The latter challenge arises because commonly personal identifying data, such as names, addresses and dates of birth of individuals, are used in the linkage process. When databases are linked across organizations, the issue of how to protect the privacy and confidentiality of such sensitive information is crucial to successful application of record linkage. In this paper we present an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data. Known as 'privacy-preserving record linkage' (PPRL), various such techniques have been developed. We present a taxonomy of PPRL techniques to characterize these techniques along 15 dimensions, and conduct a survey of PPRL techniques. We then highlight shortcomings of current techniques and discuss avenues for future research

    Scalable and approximate privacy-preserving record linkage

    No full text
    Record linkage, the task of linking multiple databases with the aim to identify records that refer to the same entity, is occurring increasingly in many application areas. Generally, unique entity identifiers are not available in all the databases to be linked. Therefore, record linkage requires the use of personal identifying attributes, such as names and addresses, to identify matching records that need to be reconciled to the same entity. Often, it is not permissible to exchange personal identifying data across different organizations due to privacy and confidentiality concerns or regulations. This has led to the novel research area of privacy-preserving record linkage (PPRL). PPRL addresses the problem of how to link different databases to identify records that correspond to the same real-world entities, without revealing the identities of these entities or any private or confidential information to any party involved in the process, or to any external party, such as a researcher. The three key challenges that a PPRL solution in a real-world context needs to address are (1) scalability to largedatabases by efficiently conducting linkage; (2) achieving high quality of linkage through the use of approximate (string) matching and effective classification of the compared record pairs into matches (i.e. pairs of records that refer to the same entity) and non-matches (i.e. pairs of records that refer to different entities); and (3) provision of sufficient privacy guarantees such that the interested parties only learn the actual values of certain attributes of the records that were classified as matches, and the process is secure with regard to any internal or external adversary. In this thesis, we present extensive research in PPRL, where we have addressed several gaps and problems identified in existing PPRL approaches. First, we begin the thesis with a review of the literature and we propose a taxonomy of PPRL to characterize existing techniques. This allows us to identify gaps and research directions. In the remainder of the thesis, we address several of the identified shortcomings. One main shortcoming we address is a framework for empirical and comparative evaluation of different PPRL solutions, which has not been studied in the literature so far. Second, we propose several novel algorithms for scalable and approximate PPRL by addressing the three main challenges of PPRL. We propose efficient private blocking techniques, for both three-party and two-party scenarios, based on sorted neighborhood clustering to address the scalability challenge. Following, we propose two efficient two-party techniques for private matching and classification to address the linkage quality challenge in terms of approximate matching and effective classification. Privacy is addressed in these approaches using efficient data perturbation techniques including k-anonymous mapping, reference values, and Bloom filters. Finally, the thesis reports on an extensive comparative evaluation of our proposed solutions with several other state-of-the-art techniques on real-world datasets, which shows that our solutions outperform others in terms of all three key challenges

    Secure Two-Party Protocol for Privacy-Preserving Classification via Differential Privacy

    Get PDF
    Privacy-preserving distributed data mining is the study of mining on distributed data—owned by multiple data owners—in a non-secure environment, where the mining protocol does not reveal any sensitive information to the data owners, the individual privacy is preserved, and the output mining model is practically useful. In this thesis, we propose a secure two-party protocol for building a privacy-preserving decision tree classifier over distributed data using differential privacy. We utilize secure multiparty computation to ensure that the protocol is privacy-preserving. Our algorithm also utilizes parallel and sequential compositions, and applies distributed exponential mechanism to ensure that the output is differentially-private. We implemented our protocol in a distributed environment on real-life data, and the experimental results show that the protocol produces decision tree classifiers with high utility while being reasonably efficient and scalable

    Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

    Get PDF
    The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.This work was partially funded by the Australian Research Council under Discovery Project DP130101801, the German Academic Exchange Service (DAAD) and Universities Australia (UA) under the Joint Research Co-operation Scheme, and also funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B)

    A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage

    No full text
    Today many application domains, such as national statistics, healthcare, business analytic, fraud detection, and national security, require data to be integrated from multiple databases. Record linkage (RL) is a process used in data integration which links multiple databases to identify matching records that belong to the same entity. RL enriches the usefulness of data by removing duplicates, errors, and inconsistencies which improves the effectiveness of decision making in data analytic applications. Often, organisations are not willing or authorised to share the sensitive information in their databases with any other party due to privacy and confidentiality regulations. The linkage of databases of different organisations is an emerging research area known as privacy-preserving record linkage (PPRL). PPRL facilitates the linkage of databases by ensuring the privacy of the entities in these databases. In multidatabase (MD) context, PPRL is significantly challenged by the intrinsic exponential growth in the number of potential record pair comparisons. Such linkage often requires significant time and computational resources to produce the resulting matching sets of records. Due to increased risk of collusion, preserving the privacy of the data is more problematic with an increase of number of parties involved in the linkage process. Blocking is commonly used to scale the linkage of large databases. The aim of blocking is to remove those record pairs that correspond to non-matches (refer to different entities). Many techniques have been proposed for RL and PPRL for blocking two databases. However, many of these techniques are not suitable for blocking multiple databases. This creates a need to develop blocking technique for the multidatabase linkage context as real-world applications increasingly require more than two databases. This thesis is the first to conduct extensive research on blocking for multidatabase privacy-preserved record linkage (MD-PPRL). We consider several research problems in blocking of MD-PPRL. First, we start with a broad background literature on PPRL. This allow us to identify the main research gaps that need to be investigated in MD-PPRL. Second, we introduce a blocking framework for MD-PPRL which provides more flexibility and control to database owners in the block generation process. Third, we propose different techniques that are used in our framework for (1) blocking of multiple databases, (2) identifying blocks that need to be compared across subgroups of these databases, and (3) filtering redundant record pair comparisons by the efficient scheduling of block comparisons to improve the scalability of MD-PPRL. Each of these techniques covers an important aspect of blocking in real-world MD-PPRL applications. Finally, this thesis reports on an extensive evaluation of the combined application of these methods with real datasets, which illustrates that they outperform existing approaches in term of scalability, accuracy, and privacy

    Innovative Verfahren fĂĽr die standortĂĽbergreifende Datennutzung in der medizinischen Forschung

    Get PDF
    Implementing modern data-driven medical research approaches ("Artificial intelligence", "Data Science") requires access to large amounts of data ("Big Data"). Typically, this can only be achieved through cross-institutional data use and exchange ("Data Sharing"). In this process, the protection of the privacy of patients and probands affected is a central challenge. Various methods can be used to meet this challenge, such as anonymization or federation. However, data sharing is currently put into practice only to a limited extent, although it is demanded and promoted from many sides. One reason for this is the lack of clarity about the advantages and disadvantages of different data sharing approaches. The first goal of this thesis was to develop an instrument that makes these advantages and disadvantages more transparent. The instrument systematizes approaches based on two dimensions - utility and protection - where each dimension is further differentiated with three axes describing different aspects of the dimensions, such as the degree of privacy protection provided by the results of performed analyses or the flexibility of a platform regarding the types of analyses that can be performed. The instrument was used for evaluation purposes to analyze the status quo and to identify gaps and potentials for innovative approaches. Next, and as a second goal, an innovative tool for the practical use of cryptographic data sharing methods has been designed and implemented. So far, such approaches are only rarely used in practice due to two main obstacles: (1) the technical complexity of setting up a cryptography-based data sharing infrastructure and (2) a lack of user-friendliness of cryptographic data sharing methods, especially for medical researchers. The tool EasySMPC, which was developed as part of this work, is characterized by the fact that it allows cryptographically secure computation of sums (e.g., frequencies of diagnoses) across institutional boundaries based on an easy-to-use graphical user interface. Neither technical expertise nor the deployment of specific infrastructure components is necessary for its practical use. The practicability of EasySMPC was analyzed experimentally in a detailed performance evaluation.Moderne datengetriebene medizinische Forschungsansätze („Künstliche Intelligenz“, „Data Science“) benötigen große Datenmengen („Big Data“). Dies kann im Regelfall nur durch eine institutionsübergreifende Datennutzung erreicht werden („Data Sharing“). Datenschutz und der Schutz der Privatsphäre der Betroffenen ist dabei eine zentrale Herausforderung. Um dieser zu begegnen, können verschiedene Methoden, wie etwa Anonymisierungsverfahren oder föderierte Auswertungen, eingesetzt werden. Allerdings findet Data Sharing in der Praxis nur selten statt, obwohl es von vielen Seiten gefordert und gefördert wird. Ein Grund hierfür ist die Unklarheit ¸über Vor- und Nachteile verschiedener Data Sharing-Ansätze. Erstes Ziel dieser Arbeit war es, ein Instrument zu entwickeln, welches diese Vor- und Nachteile transparent macht. Das Instrument bewertet Ansätze anhand von zwei Dimensionen - Nutzen und Schutz - wobei jede Dimension mit drei Achsen weiter differenziert ist. Die Achsen bestehen etwa aus dem Grad des Schutzes der Privatsphäre, der durch die Ergebnisse der durchgeführten Analysen gewährleistet wird oder der Flexibilität einer Plattform hinsichtlich der Arten von Analysen, die durchgeführt werden können. Das Instrument wurde zu Evaluationszwecken für die Analyse des Status Quo sowie zur Identifikation von Lücken und Potenzialen für innovative Verfahren eingesetzt. Als zweites Ziel wurde anschließend ein innovatives Werkzeug für den praktischen Einsatz von kryptographischen Data Sharing-Verfahren entwickelt. Der Einsatz entsprechender Ansätze scheitert bisher vor allem an zwei Barrieren: (1) der technischen Komplexität beim Aufbau einer Kryptographie-basierten Data Sharing-Infrastruktur und (2) der Benutzerfreundlichkeit kryptographischer Data Sharing-Verfahren, insbesondere für medizinische Forschende. Das neue Werkzeug EasySMPC zeichnet sich dadurch aus, dass es eine kryptographisch sichere Berechnung von Summen (beispielsweise Häufigkeiten von Diagnosen) über Institutionsgrenzen hinweg auf Basis einer einfach zu bedienenden graphischen Benutzeroberfläche ermöglicht. Zur Anwendung ist weder technische Expertise noch der Aufbau spezieller Infrastrukturkomponenten notwendig. Die Praxistauglichkeit von EasySMPC wurde in einer ausführlichen Performance-Evaluation experimentell analysiert

    A new semantic similarity join method using diffusion maps and long string table attributes

    Get PDF
    With the rapid increase of the distributed data sources, and in order to make information integration, there is a need to combine the information that refers to the same entity from different sources. However, there are no global conventions that control the format of the data, and it is impractical to impose such global conventions. Also, there could be some spelling errors in the data as it is entered manually in most of the cases. For such reasons, the need to find and join similar records instead of exact records is important in order to integrate the data. Most of the previous work has concentrated on similarity join when the join attribute is a short string attribute, such as person name and address. However, most databases contain long string attributes as well, such as product description and paper abstract, and up to our knowledge, no work has been done in this direction. The use of long string attributes is promising as these attributes contain much more information than short string attributes, which could improve the similarity join performance. On the other hand, most of the literature work did not consider the semantic similarities during the similarity join process. To address these issues, 1) we showed that the use of long attributes outperformed the use of short attributes in the similarity join process in terms of similarity join accuracy with a comparable running time under both supervised and unsupervised learning scenarios; 2) we found the best semantic similarity method to join long attributes in both supervised and unsupervised learning scenarios; 3) we proposed efficient semantic similarity join methods using long attributes under both supervised and unsupervised learning scenarios; 4) we proposed privacy preserving similarity join protocols that supports the use of long attributes to increase the similarity join accuracy under both supervised and unsupervised learning scenarios; 5) we studied the effect of using multi-label supervised learning on the similarity join performance; 6) we found an efficient similarity join method for expandable databases
    • …
    corecore