11 research outputs found

    fmi-ii: Table of Contents

    Get PDF

    Record Duplication Detection in Database: A Review

    Get PDF
    The recognition of similar entities in databases has gained substantial attention in many application areas. Despite several techniques proposed to recognize and locate duplication of database records, there is a dearth of studies available which rate the effectiveness of the diverse techniques used for duplicate record detection. Calculating time complexity of the proposed methods reveals their performance rating. The time complexity calculation showed that the efficiency of these methods improved when blocking and windowing is applied. Some domain-specific methods train systems to optimize results and improve efficiency and scalability, but they are prone to errors. Most of the existing methods fail to either discuss, or lack thoroughness in consideration of scalability. The process of sorting and searching form an essential part of duplication detection, but they are time-consuming. Therefore this paper proposes the possibility of eliminating the sorting process by utilization of tree structure to improve the record duplication detection. This has added benefits of reducing time required, and offers a probable increase in scalability. For database system, scalability is an inherent feature for any proposed solution, due to the fact that the data size is huge. Improving the efficiency in identifying duplicate records in databases is an essential step for data cleaning and data integration methods. This paper reveals that the current proposed methods lack in providing solutions that are scalable, high accurate, and reduce the processing time during detecting duplication of records in database. The ability to provide solutions to this problem will improve the quality of data that are used for decision making process

    Exploiting Record Similarity for Practical Vertical Federated Learning

    Full text link
    As the privacy of machine learning has drawn increasing attention, federated learning is introduced to enable collaborative learning without revealing raw data. Notably, \textit{vertical federated learning} (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, existing studies in VFL rarely study the ``record linkage'' process. They either design algorithms assuming the data from different parties have been linked or use simple linkage methods like exact-linkage or top1-linkage. These approaches are unsuitable for many applications, such as the GPS location and noisy titles requiring fuzzy matching. In this paper, we design a novel similarity-based VFL framework, FedSim, which is suitable for more real-world applications and achieves higher performance on traditional VFL tasks. Moreover, we theoretically analyze the privacy risk caused by sharing similarities. Our experiments on three synthetic datasets and five real-world datasets with various similarity metrics show that FedSim consistently outperforms other state-of-the-art baselines

    Decreased thalamic monoamine availability in drug-induced parkinsonism

    Get PDF
    Drug-induced parkinsonism (DIP) is caused by a dopamine receptor blockade and is a major cause of misleading diagnosis of Parkinson's disease (PD). Striatal dopamine activity has been investigated widely in DIP; however, most studies with dopamine transporter imaging have focused on the clinical characteristics and prognosis. This study investigated differences in striatal subregional monoamine availability among patients with DIP, normal controls, and patients with early PD. Thirty-five DIP patients, the same number of age-matched PD patients, and 46 healthy controls were selected for this study. Parkinsonian motor status was examined. Brain magnetic resonance imaging and positron emission tomography with 18F-N-(3-fluoropropyl)-2beta-carbon ethoxy-3beta-(4-iodophenyl) nortropane were performed, and the regional standardized uptake values were analyzed with a volume-of-interest template and compared among the groups. The groups were evenly matched for age, but there were numerically more females in the DIP group. Parkinsonian motor symptoms were similar in the DIP and PD groups. Monoamine availability in the thalamus of the DIP group was lower than that of the normal controls and similar to that of the PD group. In other subregions (putamen, globus pallidus, and ventral striatum), monoamine availability in the DIP group and normal controls did not differ and was higher than that in the PD group. This difference compared to healthy subject suggests that low monoamine availability in the thalamus could be an imaging biomarker of DIP.ope

    Gute Praxis Datenlinkage (GPD) : Good Practice Data Linkage

    Get PDF
    Das personenbezogene Verknüpfen verschiedener Datenquellen (Datenlinkage) für Forschungszwecke findet in den letzten Jahren in Deutschland zunehmend Anwendung. Jedoch fehlen hierfür konsentierte methodische Standards. Ziel dieses Beitrages ist es, solche Standards für Forschungsvorhaben zu definieren. Eine weitere Intention ist es, dem Lesenden eine Checkliste zur Bewertung geplanter Forschungsvorhaben und Artikel bereitzustellen. Zu diesem Zweck hat eine aus Mitgliedern verschiedener Fachgesellschaften zusammengesetzte Expertengruppe seit 2016 insgesamt 7 Leitlinien mit 27 konkreten Empfehlungen erstellt. Die Gute Praxis Datenlinkage beinhaltet die folgenden Leitlinien: (1) Forschungsziele, Fragestellung, Datenquellen und Ressourcen, (2) Dateninfrastruktur und Datenfluss, (3) Datenschutz, (4) Ethik, (5) Schlüsselvariablen und Linkageverfahren, (6) Datenprüfung/Qualitätssicherung sowie (7) Langfristige Datennutzung für noch festzulegende Fragestellungen. Jede Leitlinie wird ausführlich diskutiert. Zukünftige Aktualisierungen werden wissenschaftliche und datenschutzrechtliche Entwicklungen berücksichtigen

    A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage

    No full text
    Today many application domains, such as national statistics, healthcare, business analytic, fraud detection, and national security, require data to be integrated from multiple databases. Record linkage (RL) is a process used in data integration which links multiple databases to identify matching records that belong to the same entity. RL enriches the usefulness of data by removing duplicates, errors, and inconsistencies which improves the effectiveness of decision making in data analytic applications. Often, organisations are not willing or authorised to share the sensitive information in their databases with any other party due to privacy and confidentiality regulations. The linkage of databases of different organisations is an emerging research area known as privacy-preserving record linkage (PPRL). PPRL facilitates the linkage of databases by ensuring the privacy of the entities in these databases. In multidatabase (MD) context, PPRL is significantly challenged by the intrinsic exponential growth in the number of potential record pair comparisons. Such linkage often requires significant time and computational resources to produce the resulting matching sets of records. Due to increased risk of collusion, preserving the privacy of the data is more problematic with an increase of number of parties involved in the linkage process. Blocking is commonly used to scale the linkage of large databases. The aim of blocking is to remove those record pairs that correspond to non-matches (refer to different entities). Many techniques have been proposed for RL and PPRL for blocking two databases. However, many of these techniques are not suitable for blocking multiple databases. This creates a need to develop blocking technique for the multidatabase linkage context as real-world applications increasingly require more than two databases. This thesis is the first to conduct extensive research on blocking for multidatabase privacy-preserved record linkage (MD-PPRL). We consider several research problems in blocking of MD-PPRL. First, we start with a broad background literature on PPRL. This allow us to identify the main research gaps that need to be investigated in MD-PPRL. Second, we introduce a blocking framework for MD-PPRL which provides more flexibility and control to database owners in the block generation process. Third, we propose different techniques that are used in our framework for (1) blocking of multiple databases, (2) identifying blocks that need to be compared across subgroups of these databases, and (3) filtering redundant record pair comparisons by the efficient scheduling of block comparisons to improve the scalability of MD-PPRL. Each of these techniques covers an important aspect of blocking in real-world MD-PPRL applications. Finally, this thesis reports on an extensive evaluation of the combined application of these methods with real datasets, which illustrates that they outperform existing approaches in term of scalability, accuracy, and privacy

    Privacy-preserving matching of similar patients

    No full text
    The identification of similar entities represented by records in different databases has drawn considerable attention in many application areas, including in the health domain. One important type of entity matching application that is vital for quality healthcare analytics is the identification of similar patients, known as similar patient matching. A key component of identifying similar records is the calculation of similarity of the values in attributes (fields) between these records. Due to increasing privacy and confidentiality concerns, using the actual attribute values of patient records to identify similar records across different organizations is becoming non-trivial because the attributes in such records often contain highly sensitive information such as personal and medical details of patients. Therefore, the matching needs to be based on masked (encoded) values while being effective and efficient to allow matching of large databases. Bloom filter encoding has widely been used as an efficient masking technique for privacy-preserving matching of string and categorical values. However, no work on Bloom filter-based masking of numerical data, such as integer (e.g. age), floating point (e.g. body mass index), and modulus (numbers wrap around upon reaching a certain value, e.g. date and time), which are commonly required in the health domain, has been presented in the literature. We propose a framework with novel methods for masking numerical data using Bloom filters, thereby facilitating the calculation of similarities between records. We conduct an empirical study on publicly available real-world datasets which shows that our framework provides efficient masking and achieves similar matching accuracy compared to the matching of actual unencoded patient records
    corecore