13 research outputs found

    Privacy preserving interactive record linkage (PPIRL)

    Get PDF
    Record linkage to integrate uncoordinated databases is critical in biomedical research using Big Data. Balancing privacy protection against the need for high quality record linkage requires a human–machine hybrid system to safely manage uncertainty in the ever changing streams of chaotic Big Data

    Point-of-contact Interactive Record Linkage (PIRL): A software tool to prospectively link demographic surveillance and health facility data.

    Get PDF
    Linking a health and demographic surveillance system (HDSS) to data from a health facility that serves the HDSS population generates a research infrastructure for directly observed data on access to and utilization of health facility services. Many HDSS sites, however, are in areas that lack unique national identifiers or suffer from data quality issues, such as incomplete records, spelling errors, and name and residence changes, all of which complicate record linkage approaches when applied retrospectively. We developed Point-of-contact Interactive Record Linkage (PIRL) software that is used to prospectively link health records from a local health facility to an HDSS in rural Tanzania. This prospective approach to record linkage is carried out in the presence of the individual whose records are being linked, which has the advantage that any uncertainty surrounding their identity can be resolved during a brief interaction, whereby extraneous information (e.g., household membership) can be referred to as an additional criterion to adjudicate between multiple potential matches. Our software uses a probabilistic record linkage algorithm based on the Fellegi-Sunter model to search and rank potential matches in the HDSS data source. Key advantages of this software are its ability to perform multiple searches for the same individual and save patient-specific notes that are retrieved during subsequent clinic visits. A search on the HDSS database (n=110,000) takes less than 15 seconds to complete. Excluding time spent obtaining written consent, the median duration of time we spend with each patient is six minutes. In this setting, a purely automated retrospective approach to record linkage would have only correctly identified about half of the true matches and resulted in high linkage errors; therefore highlighting immediate benefit of conducting interactive record linkage using the PIRL software

    Quality of record linkage in a highly automated cancer registry that relies on encrypted identity data

    Get PDF
    Objectives: In the absence of unique ID numbers, cancer and other registries in Germany and elsewhere rely on identity data to link records pertaining to the same patient. These data are often encrypted to ensure privacy. Some record linkage errors unavoidably occur. These errors were quantified for the cancer registry of North Rhine Westphalia which uses encrypted identity data. Methods: A sample of records was drawn from the registry, record linkage information was included. In parallel, plain text data for these records were retrieved to generate a gold standard. Record linkage error frequencies in the cancer registry were determined by comparison of the results of the routine linkage with the gold standard. Error rates were projected to larger registries. Results: In the sample studied, the homonym error rate was 0.015%; the synonym error rate was 0.2%. The F-measure was 0.9921. Projection to larger databases indicated that for a realistic development the homonym error rate will be around 1%, the synonym error rate around 2%. Conclusion: Observed error rates are low. This shows that effective methods to standardize and improve the quality of the input data have been implemented. This is crucial to keep error rates low when the registry’s database grows. The planned inclusion of unique health insurance numbers is likely to further improve record linkage quality. Cancer registration entirely based on electronic notification of records can process large amounts of data with high quality of record linkage

    CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

    Get PDF
    Background: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures

    Big Data in Science and Healthcare: A Review of Recent Literature and Perspectives

    Get PDF
    Objectives: As technology continues to evolve and rise in various industries, such as healthcare, science, education, and gaming, a sophisticated concept known as Big Data is surfacing. The concept of analytics aims to understand data. We set out to portray and discuss perspectives of the evolving use of Big Data in science and healthcare and, to examine some of the opportunities and challenges. Methods: A literature review was conducted to highlight the implications associated with the use of Big Data in scientific research and healthcare innovations, both on a large and small scale. Results: Scientists and health-care providers may learn from one another when it comes to understanding the value of Big Data and analytics. Small data, derived by patients and consumers, also requires analytics to become actionable. Connectivism provides a framework for the use of Big Data and analytics in the areas of science and healthcare. This theory assists individuals to recognize and synthesize how human connections are driving the increase in data. Despite the volume and velocity of Big Data, it is truly about technology connecting humans and assisting them to construct knowledge in new ways. Concluding Thoughts: The concept of Big Data and associated analytics are to be taken seriously when approaching the use of vast volumes of both structured and unstructured data in science and health-care. Future exploration of issues surrounding data privacy, confidentiality, and education are needed. A greater focus on data from social media, the quantified self-movement, and the application of analytics to small data would also be useful

    Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes

    Get PDF
    International audienceWe develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and provides a posterior probability of matching for each patient pair, while considering all the data at once. Both in our simulation study (using an administrative claims dataset for data generation) and in two real use-cases linking patient electronic health records from a large tertiary care network, our method exhibits good performance and compares favourably to the standard baseline Fellegi-Sunter algorithm. We propose a scalable, fast and efficient open-source implementation in the ludic R package available on CRAN, which also includes the anonymized diagnosis code data from our real use-case. This work suggests it is possible to link de-identified research databases stripped of any personal health identifiers using only diagnosis codes, provided sufficient information is shared between the data sources

    Pattern masking for dictionary matching

    Get PDF
    Data masking is a common technique for sanitizing sensitive data maintained in database systems, and it is also becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem. In PMDM, we are given a dictionary of d strings, each of length , a query string q of length , and a positive integer z, and we are asked to compute a smallest set K ⊆ {1,
,}, so that if q[i] is replaced by a wildcard for all i ∈ K, then q matches at least z strings from . Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We thus approach the problem from a more practical perspective. We show a combinatorial ((d)^{|K|/3}+d)-time and (d)-space algorithm for PMDM for |K| = (1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails [Abboud et al., SIAM J. Comput. 2018; Lincoln et al., SODA 2018]. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in . Note that PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from . The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time (2^ d). We present a data structure for PMDM that answers queries over in time (2^{/2}(2^{/2}+τ)) and requires space (2^ dÂČ/τÂČ+2^{/2}d), for any parameter τ ∈ [1,d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., SODA 2017]. This gives a polynomial-time (d^{1/4+Δ})-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. </p

    Privacy preserving algorithms for newly emergent computing environments

    Get PDF
    Privacy preserving data usage ensures appropriate usage of data without compromising sensitive information. Data privacy is a primary requirement since customers' data is an asset to any organization and it contains customers' private information. Data seclusion cannot be a solution to keep data private. Data sharing as well as keeping data private is important for different purposes, e.g., company welfare, research, business etc. A broad range of industries where data privacy is mandatory includes healthcare, aviation industry, education system, federal law enforcement, etc.In this thesis dissertation we focus on data privacy schemes in emerging fields of computer science, namely, health informatics, data mining, distributed cloud, biometrics, and mobile payments. Linking and mining medical records across different medical service providers are important to the enhancement of health care quality. Under HIPAA regulation keeping medical records private is important. In real-world health care databases, records may well contain errors. Linking the error-prone data and preserving data privacy at the same time is very difficult. We introduce a privacy preserving Error-Tolerant Linking Algorithm to enable medical records linkage for error-prone medical records. Mining frequent sequential patterns such as, patient path, treatment pattern, etc., across multiple medical sites helps to improve health care quality and research. We propose a privacy preserving sequential pattern mining scheme across multiple medical sites. In a distributed cloud environment resources are provided by users who are geographically distributed over a large area. Since resources are provided by regular users, data privacy and security are main concerns. We propose a privacy preserving data storage mechanism among different users in a distributed cloud. Managing secret key for encryption is difficult in a distributed cloud. To protect secret key in a distributed cloud we propose a multilevel threshold secret sharing mechanism. Biometric authentication ensures user identity by means of user's biometric traits. Any individual's biometrics should be protected since biometrics are unique and can be stolen or misused by an adversary. We present a secure and privacy preserving biometric authentication scheme using watermarking technique. Mobile payments have become popular with the extensive use of mobile devices. Mobile applications for payments needs to be very secure to perform transactions and at the same time needs to be efficient. We design and develop a mobile application for secure mobile payments. To secure mobile payments we focus on user's biometric authentication as well as secure bank transaction. We propose a novel privacy preserving biometric authentication algorithm for secure mobile payments

    Scalable and approximate privacy-preserving record linkage

    No full text
    Record linkage, the task of linking multiple databases with the aim to identify records that refer to the same entity, is occurring increasingly in many application areas. Generally, unique entity identifiers are not available in all the databases to be linked. Therefore, record linkage requires the use of personal identifying attributes, such as names and addresses, to identify matching records that need to be reconciled to the same entity. Often, it is not permissible to exchange personal identifying data across different organizations due to privacy and confidentiality concerns or regulations. This has led to the novel research area of privacy-preserving record linkage (PPRL). PPRL addresses the problem of how to link different databases to identify records that correspond to the same real-world entities, without revealing the identities of these entities or any private or confidential information to any party involved in the process, or to any external party, such as a researcher. The three key challenges that a PPRL solution in a real-world context needs to address are (1) scalability to largedatabases by efficiently conducting linkage; (2) achieving high quality of linkage through the use of approximate (string) matching and effective classification of the compared record pairs into matches (i.e. pairs of records that refer to the same entity) and non-matches (i.e. pairs of records that refer to different entities); and (3) provision of sufficient privacy guarantees such that the interested parties only learn the actual values of certain attributes of the records that were classified as matches, and the process is secure with regard to any internal or external adversary. In this thesis, we present extensive research in PPRL, where we have addressed several gaps and problems identified in existing PPRL approaches. First, we begin the thesis with a review of the literature and we propose a taxonomy of PPRL to characterize existing techniques. This allows us to identify gaps and research directions. In the remainder of the thesis, we address several of the identified shortcomings. One main shortcoming we address is a framework for empirical and comparative evaluation of different PPRL solutions, which has not been studied in the literature so far. Second, we propose several novel algorithms for scalable and approximate PPRL by addressing the three main challenges of PPRL. We propose efficient private blocking techniques, for both three-party and two-party scenarios, based on sorted neighborhood clustering to address the scalability challenge. Following, we propose two efficient two-party techniques for private matching and classification to address the linkage quality challenge in terms of approximate matching and effective classification. Privacy is addressed in these approaches using efficient data perturbation techniques including k-anonymous mapping, reference values, and Bloom filters. Finally, the thesis reports on an extensive comparative evaluation of our proposed solutions with several other state-of-the-art techniques on real-world datasets, which shows that our solutions outperform others in terms of all three key challenges
    corecore