Search CORE

577 research outputs found

Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets.

Author: A McCallum
Adrian P. Brown
Christian Borgs
CJ Bradley
D Karapiperis
D Rosman
D Vatsalan
DP Jutte
E Durham
EA Durham
EL Brook
F Niedermeyer
G Lawrence
GH Shah
IA Binswanger
J Smith
JH Boyd
JJ Trinckes
JMM Evans
M Kroll
M Kuzu
M Kuzu
MA Hernández
MG Maxfield
P Christen
R Schnell
R Schnell
R Schnell
R Schnell
R Schnell
R Schnell
Rainer Schnell
SA McDonald
Sean M. Randall
SM Randall
SM Randall
TG Kristensen
TL Dassanayake
TN Herzog
Z Wan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Background: Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Under many jurisdictions, unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. The corresponding set of techniques for privacy-preserving record linkage (PPRL) has received widespread attention. One recent method is based on Bloom filters. Due to superior resilience against cryptographic attacks, composite Bloom filters (cryptographic long-term keys, CLKs) are considered best practice for privacy in PPRL. Real-world performance of these techniques using large-scale data is unknown up to now. Methods: Using a large subset of Australian hospital admission data, we tested the performance of an innovative PPRL technique (CLKs using multibit trees) against a gold-standard derived from clear-text probabilistic record linkage. Linkage time and linkage quality (recall, precision and F-measure) were evaluated. Results: Clear text probabilistic linkage resulted in marginally higher precision and recall than CLKs. PPRL required more computing time but 5 million records could still be de-duplicated within one day. However, the PPRL approach required fine tuning of parameters. Conclusions: We argue that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time. These costs seem to be acceptable in most applied settings, but they have to be considered in the decision to apply PPRL. Further research on the optimal automatic choice of parameters is needed

Directory of Open Access Journals

espace@Curtin

Privacy preserving record linkage meets record linkage using unencrypted data

Author: Hesam Izakian
Publication venue: 'Swansea University'
Publication date: 01/08/2018
Field of study

Introduction Privacy preserving record linkage (PPRL) resolves privacy concerns because of its capabilities to link encrypted identifiers. It encrypts identifiers using bloom filters and performs record matching based on encrypted data using dice coefficient similarity. Matching data based on hashed identifiers impacts the performance of linkage due to loss of information. Objectives and Approach We propose a technique to optimize the bloom filter parameters and examine if the optimal parameters increase the performance of the linkage in terms of precision, recall, and f-measure. Let us consider a set of string values and calculate the similarity between any two of them using the Jaro-Winkler method. Now let us encrypt the string values using bloom filters and calculate the similarity between any two of them using the dice coefficient technique. Optimal parameters of bloom filters are those that minimize the difference between the calculated similarities using Jaro-Winkler vs. the calculated similarities using the dice coefficient technique. Results Using publically available data, several first name and last name datasets each comprising 1000 unique values were generated. The following values for bloom filter parameters were considered: q in q-grams (q=1,2,3), bit array length (l=50,100,200,500,1000), number of hash functions (k=5,10,20,50). The following five setups of bloom filters were able to minimize the difference between the calculated similarities on encrypted data using the dice coefficient technique, and the calculated similarities on unencrypted data using the Jaro-Winkler method: q=1,l=1000,k=50/q=1,l=500,k=20/ q=2,l=1000,k=50/ q=3,l=500,k=50. These setups were considered to perform data linkage over 10 synthetically-generated datasets. Results show that PPRL was able to achieve similar performance compared to data linkage over unencrypted data. Conclusion/Implications This study showed that optimal parameters of bloom filters minimized loss of information resulting from data encryption. Experimental findings indicated that PPRL using optimal parameters of bloom filters achieves almost the same performance as data linkage on unencrypted data in terms of precision, recall, and f-measure

Directory of Open Access Journals

Privacy Preserving Probabilistic Record Linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality.

Author: Clough-Gorr Kerri M
Schmidlin Kurt
Spoerri Adrian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

BACKGROUND Record linkage of existing individual health care data is an efficient way to answer important epidemiological research questions. Reuse of individual health-related data faces several problems: Either a unique personal identifier, like social security number, is not available or non-unique person identifiable information, like names, are privacy protected and cannot be accessed. A solution to protect privacy in probabilistic record linkages is to encrypt these sensitive information. Unfortunately, encrypted hash codes of two names differ completely if the plain names differ only by a single character. Therefore, standard encryption methods cannot be applied. To overcome these challenges, we developed the Privacy Preserving Probabilistic Record Linkage (P3RL) method. METHODS In this Privacy Preserving Probabilistic Record Linkage method we apply a three-party protocol, with two sites collecting individual data and an independent trusted linkage center as the third partner. Our method consists of three main steps: pre-processing, encryption and probabilistic record linkage. Data pre-processing and encryption are done at the sites by local personnel. To guarantee similar quality and format of variables and identical encryption procedure at each site, the linkage center generates semi-automated pre-processing and encryption templates. To retrieve information (i.e. data structure) for the creation of templates without ever accessing plain person identifiable information, we introduced a novel method of data masking. Sensitive string variables are encrypted using Bloom filters, which enables calculation of similarity coefficients. For date variables, we developed special encryption procedures to handle the most common date errors. The linkage center performs probabilistic record linkage with encrypted person identifiable information and plain non-sensitive variables. RESULTS In this paper we describe step by step how to link existing health-related data using encryption methods to preserve privacy of persons in the study. CONCLUSION Privacy Preserving Probabilistic Record linkage expands record linkage facilities in settings where a unique identifier is unavailable and/or regulations restrict access to the non-unique person identifiable information needed to link existing health-related data sets. Automated pre-processing and encryption fully protect sensitive information ensuring participant confidentiality. This method is suitable not just for epidemiological research but also for any setting with similar challenges

Bern Open Repository and Information System (BORIS)

Approximate Two-Party Privacy-Preserving String Matching with Linear Complexity

Author: Beck Martin
Kerschbaum Florian
Publication venue
Publication date: 12/02/2013
Field of study

Consider two parties who want to compare their strings, e.g., genomes, but do not want to reveal them to each other. We present a system for privacy-preserving matching of strings, which differs from existing systems by providing a deterministic approximation instead of an exact distance. It is efficient (linear complexity), non-interactive and does not involve a third party which makes it particularly suitable for cloud computing. We extend our protocol, such that it mitigates iterated differential attacks proposed by Goodrich. Further an implementation of the system is evaluated and compared against current privacy-preserving string matching algorithms.Comment: 6 pages, 4 figure

arXiv.org e-Print Archive

Recommended from our members

An efficient Privacy-Preserving Record Linkage Technique for Administrative Data and Censuses

Author: Schnell R.
Publication venue: 'IOS Press'
Publication date: 01/01/2014
Field of study

Increasingly, administrative data is being used for statistical purposes, such as for registry-based census taking. Due to privacy concerns, this often requires linking separate files containing information on the same unit without revealing the identity of the unit. If the linkage has to be done without a unique identification number, it is necessary to compare keys derived from personal identifiers. When dealing with large files such as census data, comparing each possible pair of keys for two files is impossible. Therefore, special algorithms (blocking methods) must be used to reduce the number of comparisons needed. If the identifiers have to be encrypted due to privacy concerns, the number of available algorithms for record linkage and blocking is very limited. This paper describes the combination of a recently introduced encryption method for identifiers with a novel algorithm for blocking. Simulations show that the performance of these techniques allows their use for Big Data applications, censuses and population registries

City Research Online

Recommended from our members

Cryptanalysis of Basic Bloom Filters Used for Privacy Preserving Record Linkage

Author: Kroll M.
Niedermeyer F.
Schnell R.
Steinmetzer S.
Publication venue
Publication date: 31/12/2014
Field of study

Bloom filter encoded identifiers are increasingly used for privacy preserving record linkage applications, because they allow for errors in encrypted identifiers. However, little research on the security of Bloom filters has been published so far. In this paper, we formalize a successful attack on Bloom filters composed of bigrams. It has previously been assumed in the literature that an attacker knows the global data set from which a sample is drawn. In contrast, we suppose that an attacker does not know this global data set. Instead, we assume the adversary knows a publicly available list of the most frequent attributes. The attack is based on subtle filtering and elementary statistical analysis of encrypted bigrams. The attack described in this paper can be used for the deciphering of a whole database instead of only a small subset of the most frequent names, as in previous research. We illustrate our proposed method with an attack on a database of encrypted surnames. Finally, we describe modifications of the Bloom filters for preventing similar attacks

City Research Online

Privacy preserving record linkage in the presence of missing values

Author: Chi Yuan
Hong Jun
Jurek Anna
Liu Weiru
O'Reilly Dermot
Publication venue: 'Elsevier BV'
Publication date: 05/07/2017
Field of study

© 2017 The problem of record linkage is to identify records from two datasets, which refer to the same entities (e.g. patients). A particular issue of record linkage is the presence of missing values in records, which has not been fully addressed. Another issue is how privacy and confidentiality can be preserved in the process of record linkage. In this paper, we propose an approach for privacy preserving record linkage in the presence of missing values. For any missing value in a record, our approach imputes the similarity measure between the missing value and the value of the corresponding field in any of the possible matching records from another dataset. We use the k-NNs (k Nearest Neighbours in the same dataset) of the record with the missing value and their distances to the record for similarity imputation. For privacy preservation, our approach uses the Bloom filter protocol in the settings of both standard privacy preserving record linkage without missing values and privacy preserving record linkage with missing values. We have conducted an experimental evaluation using three pairs of synthetic datasets with different rates of missing values. Our experimental results show the effectiveness and efficiency of our proposed approach

Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation

Author: A Beimel
A Geissbuhler
AF Karr
AF Karr
AL Potosky
Antonis Michalas
AS Lunde
B Pinkas
BA Malin
BA Stewart
BH Bloom
C Clifton
C Friedman
C Quantin
D Vatsalan
EA Durham
G Cormode
G Hripcsak
GM Weber
GM Weber
GM Weber
IS Kohane
J Gichoya
J Vaidya
JF Ludvigsson
JH Holmes
JL Warren
Johan Gustav Bellika
JT Finnell
K Emam El
K Emam El
K Emam El
K Emam El
Kassaye Yitbarek Yigzaw
L Fan
L Lenert
LH Curtis
M Kantarcioglu
MA Hailemichael
MA Hernández
MK Ross
O Goldreich
P Christen
P Paillier
P Saint-Andre
R Cramer
R Lazarus
R Lazarus
R Schnell
RL Richesson
S Tarkoma
SC Pohlig
SM Randall
T Dimitriou
W Du
W Du
WB Lober
Y Lindell
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Background Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. Methods We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. Results The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N − 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. Conclusions The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians