289 research outputs found

    CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

    Get PDF
    Background: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures

    Quality of record linkage in a highly automated cancer registry that relies on encrypted identity data

    Get PDF
    Objectives: In the absence of unique ID numbers, cancer and other registries in Germany and elsewhere rely on identity data to link records pertaining to the same patient. These data are often encrypted to ensure privacy. Some record linkage errors unavoidably occur. These errors were quantified for the cancer registry of North Rhine Westphalia which uses encrypted identity data. Methods: A sample of records was drawn from the registry, record linkage information was included. In parallel, plain text data for these records were retrieved to generate a gold standard. Record linkage error frequencies in the cancer registry were determined by comparison of the results of the routine linkage with the gold standard. Error rates were projected to larger registries. Results: In the sample studied, the homonym error rate was 0.015%; the synonym error rate was 0.2%. The F-measure was 0.9921. Projection to larger databases indicated that for a realistic development the homonym error rate will be around 1%, the synonym error rate around 2%. Conclusion: Observed error rates are low. This shows that effective methods to standardize and improve the quality of the input data have been implemented. This is crucial to keep error rates low when the registry’s database grows. The planned inclusion of unique health insurance numbers is likely to further improve record linkage quality. Cancer registration entirely based on electronic notification of records can process large amounts of data with high quality of record linkage

    Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage

    Get PDF
    Record linkage which refers to the identification of the same entities across several databases in the absence of an unique identifier is a crucial step for data integration. In this research, we study the effectiveness and efficiency of different machine learning algorithms (SVM, Random Forest, and neural networks) to link databases in a controlled experiment. We control for % of heterogeneity in data and size of training dataset. We evaluate the algorithms based on (1) quality of linkages such as F1 score based on a one threshold model and (2) size of uncertain regions that need manual review based on a two threshold model. We find that random forests performed very well both in terms of traditional metrics like F1 score (99.2% - 95.9%) as well as manual review set size (7.1% - 21%) for error rates from 0% to 60%. Though in terms of F1 scores, the algorithms (Random Forests, SVMs and Neural Nets) fared fairly similar, random forests outperformed the next best model by 28% on average in terms of the percentage of pairs that need manual review

    Semantic Systems. In the Era of Knowledge Graphs

    Get PDF
    This open access book constitutes the refereed proceedings of the 16th International Conference on Semantic Systems, SEMANTiCS 2020, held in Amsterdam, The Netherlands, in September 2020. The conference was held virtually due to the COVID-19 pandemic

    ORÁCULO: Detection of Spatiotemporal Hot Spots of Conflict-Related Events Extracted from Online News Sources

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Geographic Information Systems and ScienceAchieving situational awareness in peace operations requires understanding where and when conflict-related activity is most intense. However, the irregular nature of most factions hinders the use of remote sensing, while winning the trust of the host populations to allow the collection of wide-ranging human intelligence is a slow process. Thus, our proposed solution, ORÁCULO, is an information system which detects spatiotemporal hot spots of conflict-related activity by analyzing the patterns of events extracted from online news sources, allowing immediate situational awareness. To do so, it combines a closed-domain supervised event extractor with emerging hot spots analysis of event space-time cubes. The prototype of ORÁCULO was tested on tweets scraped from the Twitter accounts of local and international news sources covering the Central African Republic Civil War, and its test results show that it achieved near state-of-theart event extraction performance, significant overlap with a reference event dataset, and strong correlation with the hot spots space-time cube generated from the reference event dataset, proving the viability of the proposed solution. Future work will focus on improving the event extraction performance and on testing ORÁCULO in cooperation with peacekeeping organizations. Keywords: event extraction, natural language understanding, spatiotemporal analysis, peace operations, open-source intelligence.Atingir e manter a consciĂȘncia situacional em operaçÔes de paz requer o conhecimento de quando e onde Ă© que a atividade relacionada com o conflito Ă© mais intensa. PorĂ©m, a natureza irregular da maioria das façÔes dificulta o uso de deteção remota, e ganhar a confiança das populaçÔes para permitir a recolha de informaçÔes Ă© um processo moroso. Assim, a nossa solução proposta, ORÁCULO, consiste num sistema de informaçÔes que deteta “hot spots” espĂĄcio-temporais de atividade relacionada com o conflito atravĂ©s da anĂĄlise dos padrĂ”es de eventos extraĂ­dos de fontes noticiosas online, (incluindo redes sociais), permitindo consciĂȘncia situacional imediata. Nesse sentido, a nossa solução combina um extrator de eventos de domĂ­nio limitado baseado em aprendizagem supervisionada com a anĂĄlise de “hot spots” emergentes de cubos espaçotempo de eventos. O protĂłtipo de ORÁCULO foi testado em tweets recolhidos de fontes noticiosas locais e internacionais que cobrem a Guerra Civil da RepĂșblica Centro- Africana. Os resultados dos seus testes demonstram que foram conseguidos um desempenho de extração de eventos prĂłximo do estado da arte, uma sobreposição significativa com um conjunto de eventos de referĂȘncia e uma correlação forte com o cubo espaço-tempo de “hot spots” gerado a partir desse conjunto de referĂȘncia, comprovando a viabilidade da solução proposta. Face aos resultados atingidos, o trabalho futuro focar-se-ĂĄ em melhorar o desempenho de extração de eventos e em testar o sistema ORÁCULO em cooperação com organizaçÔes que conduzam operaçÔes paz

    Embedding Techniques to Solve Large-scale Entity Resolution

    Get PDF
    Entity resolution (ER) identifies and links records that belong to the same real-world entities, where an entity refer to any real-world object. It is a primary task in data integration. Accurate and efficient ER substantially impacts various commercial, security, and scientific applications. Often, there are no unique identifiers for entities in datasets/databases that would make the ER task easy. Therefore record matching depends on entity identifying attributes and approximate matching techniques. The issues of efficiently handling large-scale data remain an open research problem with the increasing volumes and velocities in modern data collections. Fast, scalable, real-time and approximate entity matching techniques that provide high-quality results are highly demanding. This thesis proposes solutions to address the challenges of lack of test datasets and the demand for fast indexing algorithms in large-scale ER. The shortage of large-scale, real-world datasets with ground truth is a primary concern in developing and testing new ER algorithms. Usually, for many datasets, there is no information on the ground truth or ‘gold standard’ data that specifies if two records correspond to the same entity or not. Moreover, obtaining test data for ER algorithms that use personal identifying keys (e.g., names, addresses) is difficult due to privacy and confidentiality issues. Towards this challenge, we proposed a numerical simulation model that produces realistic large-scale data to test new methods when suitable public datasets are unavailable. One of the important findings of this work is the approximation of vectors that represent entity identification keys and their relationships, e.g., dissimilarities and errors. Indexing techniques reduce the search space and execution time in the ER process. Based on the ideas of the approximate vectors of entity identification keys, we proposed a fast indexing technique (Em-K indexing) suitable for real-time, approximate entity matching in large-scale ER. Our Em-K indexing method provides a quick and accurate block of candidate matches for a querying record by searching an existing reference database. All our solutions are metric-based. We transform metric or non-metric spaces to a lowerdimensional Euclidean space, known as configuration space, using multidimensional scaling (MDS). This thesis discusses how to modify MDS algorithms to solve various ER problems efficiently. We proposed highly efficient and scalable approximation methods that extend the MDS algorithm for large-scale datasets. We empirically demonstrate the improvements of our proposed approaches on several datasets with various parameter settings. The outcomes show that our methods can generate large-scale testing data, perform fast real-time and approximate entity matching, and effectively scale up the mapping capacity of MDS.Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 202

    Point-of-contact interactive record linkage between demographic surveillance and health facilities to measure patterns of HIV service utilisation in Tanzania

    Get PDF
    As significant investments and efforts have been made to strengthen HIV prevention and care service provisions throughout sub-Saharan Africa, approaches to monitoring uptake of these services have grown in importance. Global HIV/AIDS organisations use routinely updated estimates of the UNAIDS 90-90-90 targets, which state by 2020, 90% of all people living with HIV (PLHIV) should be diagnosed, 90% of diagnosed PLHIV should be receiving treatment, and 90% of PLHIV receiving treatment should achieve viral suppression. Currently, estimates of these targets in sub-Saharan Africa use population based demographic and HIV serological surveillance systems, which comprehensively measure vital events and HIV status but rely on self-reports of health service use. In contrast, most analyses of health service use are limited to patients already diagnosed and enrolled into clinical care and lack a population perspective. This thesis aims to augment existing computer software towards a novel approach to record linkage – termed point-of-contact interactive record linkage (PIRL) – and produce an infrastructure of linked surveillance data and medical records from clinics located within a surveillance area in northwest Tanzania. The linked data are then used to investigate methodological and substantive research questions. Paper A details the PIRL software that was used to collect the data for this thesis. Paper B reviews the data created by PIRL and reports record linkage statistics, including match percentages and attributes associated with (un)successful linkage. A subset of personal identifiers was found to drive the success of the probabilistic linkage algorithm, and PIRL was shown to outperform a fully automated linkage approach. Paper C provides original evidence measuring bias and precision in analyses of linked data with substantial linkage errors. Paper D critiques the estimation of the first 90-90-90 target and shows that current guidelines may underestimate the percentage diagnosed by a relative factor of between 10% and 20%. Finally, Paper E determines that while HIV serological surveillance has increased testing coverage, PLHIV who were diagnosed for HIV in a facility-based clinic were statistically significantly more likely to register for HIV care than those diagnosed at village-level temporary clinics during a surveillance round. Once individuals were in care, there was no evidence of any further delays to treatment initiation by testing modality. The collective findings of this thesis demonstrate the feasibility of PIRL to link community and medical records and use the linked data to measure patterns of HIV service use in a population
    • 

    corecore