24 research outputs found

    A Comparison of Blocking Methods for Record Linkage

    Full text link
    Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.Comment: 22 pages, 2 tables, 7 figure

    Estimating parameters for probabilistic linkage of privacy-preserved datasets.

    Get PDF
    Background: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Methods: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Results: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. Conclusions: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets

    Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets.

    Get PDF
    Background: Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Under many jurisdictions, unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. The corresponding set of techniques for privacy-preserving record linkage (PPRL) has received widespread attention. One recent method is based on Bloom filters. Due to superior resilience against cryptographic attacks, composite Bloom filters (cryptographic long-term keys, CLKs) are considered best practice for privacy in PPRL. Real-world performance of these techniques using large-scale data is unknown up to now. Methods: Using a large subset of Australian hospital admission data, we tested the performance of an innovative PPRL technique (CLKs using multibit trees) against a gold-standard derived from clear-text probabilistic record linkage. Linkage time and linkage quality (recall, precision and F-measure) were evaluated. Results: Clear text probabilistic linkage resulted in marginally higher precision and recall than CLKs. PPRL required more computing time but 5 million records could still be de-duplicated within one day. However, the PPRL approach required fine tuning of parameters. Conclusions: We argue that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time. These costs seem to be acceptable in most applied settings, but they have to be considered in the decision to apply PPRL. Further research on the optimal automatic choice of parameters is needed

    Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation

    Get PDF
    Background Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. Methods We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. Results The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N − 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. Conclusions The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians

    Efficient protocols for private record linkage

    Get PDF
    Record linkage allows data from different sources to be integrated to facilitate data mining tasks. However, in many cases, records have to be linked by personally identifiable information. To prevent privacy breaches, ideally records should be linked in a private way such that no information other than the matching result is leaked in the process. In this paper, we present an exact Private Record Linkage (PRL) protocol and an approximate PRL protocol. The exact PRL protocol is based on Oblivious Bloom Intersection, which is an efficient private set intersection protocol. The approximate PRL protocol extends the exact PRL protocol by incorporating Locality Sensitive Hash functions. Both protocols are secure in the semi-honest model. We also report the evaluation results based on our C implementation of the protocols. The results show that our protocols are efficient and effective

    Efficient private multi-party numerical records matching

    No full text

    Private Blocking Technique for Multi-party Privacy-Preserving Record Linkage

    No full text

    Incognito: A Method for Obfuscating Web Data

    Full text link
    Users leave a trail of their personal data, interests, and intents whilesurfing or sharing information on the Web. Web data could thereforereveal some private/sensitive information about users basedon inference analysis. The possible identification of informationcorresponding to a single individual by an inference attack holdstrue even if the user identifiers are encoded or removed in the Webdata. Several works have been done on improving privacy of Webdata through obfuscation methods [7, 12, 18, 32]. However, thesemethods are neither comprehensive, generic to be applicable toany Web data, nor effective against adversarial attacks. To this end,we propose a privacy-aware obfuscation method for Web data addressingthese identified drawbacks of existing methods. We useprobabilistic methods to predict privacy risk ofWeb data that incorporatesall key privacy aspects, which are uniqueness, uniformity,and linkability of Web data. The Web data with high predicted riskare then obfuscated by our method to minimize the privacy riskusing semantically similar data. Our method is resistant against adversarywho has knowledge about the datasets and model learnedrisk probabilities using differential privacy-based noise addition.Experimental study conducted on two real Web datasets validatesthe significance and efficacy of our method. Our results indicatethat the average privacy risk reaches to 100% with a minimum of10 sensitive Web entries, while at most 0% privacy risk could beattained with our obfuscation method at the cost of average utilityloss of 64.3%

    Sequence Data Matching and Beyond: New Privacy-Preserving Primitives Based on Bloom Filters

    Full text link
    Bloom filter encoding has widely been used as an efficient masking technique for privacy-preserving matching functions. The existing matching techniques, however, are limited to relatively simple types such as string, categorical and signal numerical values. In this paper, we propose a new scheme that significantly extends the class of matching primitives that are based on privacy-preserving Bloom filter mechanism. These primitives include sequence data matching and popular distance-based machine learning algorithms such as KNN and SVM. Our scheme hash-maps a sequence data vector into the Bloom filter space while checking the similarity of the data points efficiently with negligible utility loss by adding a timestamp (bit) for each element in the data represented with its neighboring values. Furthermore, it includes a Laplace-like perturbation method on the constructed Bloom filters to address the weakness of deterministic probability led by encoding techniques. As a result, the proposed work guarantee the private data records are difficult to be discriminated due to collisions and differential privacy. The experimental results on three real-scenario based datasets illustrate that our method can achieve a significantly better trade-off between utility and privacy than the state-of-the-art differential privacy-based method by adding Laplace noise to the data directly

    A Privacy-Preserving-Framework-Based Blockchain and Deep Learning for Protecting Smart Power Networks

    Full text link
    Modern power systems depend on cyber-physical systems to link physical devices and control technologies. A major concern in the implementation of smart power networks is to minimize the risk of data privacy violation (e.g., by adversaries using data poisoning and inference attacks). In this article, we propose a privacy-preserving framework to achieve both privacy and security in smart power networks. The framework includes two main modules: A two-level privacy module and an anomaly detection module. In the two-level privacy module, an enhanced-proof-of-work-Technique-based blockchain is designed to verify data integrity and mitigate data poisoning attacks, and a variational autoencoder is simultaneously applied for transforming data into an encoded format for preventing inference attacks. In the anomaly detection module, a long short-Term memory deep learning technique is used for training and validating the outputs of the two-level privacy module using two public datasets. The results highlight that the proposed framework can efficiently protect data of smart power networks and discover abnormal behaviors, in comparison to several state-of-The-Art techniques