522 research outputs found

    Machine Learning Approaches to Identify Nicknames from A Statewide Health Information Exchange

    Get PDF
    Patient matching is essential to minimize fragmentation of patient data. Existing patient matching efforts often do not account for nickname use. We sought to develop decision models that could identify true nicknames using features representing the phonetical and structural similarity of nickname pairs. We identified potential male and female name pairs from the Indiana Network for Patient Care (INPC), and developed a series of features that represented their phonetical and structural similarities. Next, we used the XGBoost classifier and hyperparameter tuning to build decision models to identify nicknames using these feature sets and a manually reviewed gold standard. Decision models reported high Precision/Positive Predictive Value and Accuracy scores for both male and female name pairs despite the low number of true nickname matches in the datasets under study. Ours is one of the first efforts to identify patient nicknames using machine learning approaches

    Matching records in multiple databases using a hybridization of several technologies.

    Get PDF
    A major problem with integrating information from multiple databases is that the same data objects can exist in inconsistent data formats across databases and a variety of attribute variations, making it difficult to identify matching objects using exact string matching. In this research, a variety of models and methods have been developed and tested to alleviate this problem. A major motivation for this research is that the lack of efficient tools for patient record matching still exists for health care providers. This research is focused on the approximate matching of patient records with third party payer databases. This is a major need for all medical treatment facilities and hospitals that try to match patient treatment records with records of insurance companies, Medicare, Medicaid and the veteran\u27s administration. Therefore, the main objectives of this research effort are to provide an approximate matching framework that can draw upon multiple input service databases, construct an identity, and match to third party payers with the highest possible accuracy in object identification and minimal user interactions. This research describes the object identification system framework that has been developed from a hybridization of several technologies, which compares the object\u27s shared attributes in order to identify matching object. Methodologies and techniques from other fields, such as information retrieval, text correction, and data mining, are integrated to develop a framework to address the patient record matching problem. This research defines the quality of a match in multiple databases by using quality metrics, such as Precision, Recall, and F-measure etc, which are commonly used in Information Retrieval. The performance of resulting decision models are evaluated through extensive experiments and found to perform very well. The matching quality performance metrics, such as precision, recall, F-measure, and accuracy, are over 99%, ROC index are over 99.50% and mismatching rates are less than 0.18% for each model generated based on different data sets. This research also includes a discussion of the problems in patient records matching; an overview of relevant literature for the record matching problem and extensive experimental evaluation of the methodologies, such as string similarity functions and machine learning that are utilized. Finally, potential improvements and extensions to this work are also presented

    Sample-based XPath Ranking for Web Information Extraction

    Get PDF
    Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute

    Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage

    Get PDF
    Introduction The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus of much research in this area. With few applications of Bloom filters within a probabilistic framework, there is limited information on whether approximate matches between Bloom filtered fields can improve linkage quality. Objectives In this study, we evaluate the effectiveness of three approximate comparison methods for Bloom filters within the context of the Fellegi-Sunter model of recording linkage: Sørensen–Dice coefficient, Jaccard similarity and Hamming distance. Methods Using synthetic datasets with introduced errors to simulate datasets with a range of data quality and a large real-world administrative health dataset, the research estimated partial weight curves for converting similarity scores (for each approximate comparison method) to partial weights at both field and dataset level. Deduplication linkages were run on each dataset using these partial weight curves. This was to compare the resulting quality of the approximate comparison techniques with linkages using simple cut-off similarity values and only exact matching. Results Linkages using approximate comparisons produced significantly better quality results than those using exact comparisons only. Field level partial weight curves for a specific dataset produced the best quality results. The Sørensen-Dice coefficient and Jaccard similarity produced the most consistent results across a spectrum of synthetic and real-world datasets. Conclusion The use of Bloom filter similarity comparisons for probabilistic record linkage can produce linkage quality results which are comparable to Jaro-Winkler string similarities with unencrypted linkages. Probabilistic linkages using Bloom filters benefit significantly from the use of similarity comparisons, with partial weight curves producing the best results, even when not optimised for that particular dataset

    A Taxonomy of Privacy-Preserving Record Linkage Techniques

    Get PDF
    The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity resolution, this process has attracted interest from researchers in fields such as databases and data warehousing, data mining, information systems, and machine learning. Record linkage has various challenges, including scalability to large databases, accurate matching and classification, and privacy and confidentiality. The latter challenge arises because commonly personal identifying data, such as names, addresses and dates of birth of individuals, are used in the linkage process. When databases are linked across organizations, the issue of how to protect the privacy and confidentiality of such sensitive information is crucial to successful application of record linkage. In this paper we present an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data. Known as 'privacy-preserving record linkage' (PPRL), various such techniques have been developed. We present a taxonomy of PPRL techniques to characterize these techniques along 15 dimensions, and conduct a survey of PPRL techniques. We then highlight shortcomings of current techniques and discuss avenues for future research

    A framework for accurate, efficient private record linkage

    Get PDF
    • …
    corecore