11 research outputs found
Record Duplication Detection in Database: A Review
The recognition of similar entities in databases has gained substantial attention in many application areas. Despite several techniques proposed to recognize and locate duplication of database records, there is a dearth of studies available which rate the effectiveness of the diverse techniques used for duplicate record detection. Calculating time complexity of the proposed methods reveals their performance rating. The time complexity calculation showed that the efficiency of these methods improved when blocking and windowing is applied. Some domain-specific methods train systems to optimize results and improve efficiency and scalability, but they are prone to errors. Most of the existing methods fail to either discuss, or lack thoroughness in consideration of scalability. The process of sorting and searching form an essential part of duplication detection, but they are time-consuming. Therefore this paper proposes the possibility of eliminating the sorting process by utilization of tree structure to improve the record duplication detection. This has added benefits of reducing time required, and offers a probable increase in scalability. For database system, scalability is an inherent feature for any proposed solution, due to the fact that the data size is huge. Improving the efficiency in identifying duplicate records in databases is an essential step for data cleaning and data integration methods. This paper reveals that the current proposed methods lack in providing solutions that are scalable, high accurate, and reduce the processing time during detecting duplication of records in database. The ability to provide solutions to this problem will improve the quality of data that are used for decision making process
Exploiting Record Similarity for Practical Vertical Federated Learning
As the privacy of machine learning has drawn increasing attention, federated
learning is introduced to enable collaborative learning without revealing raw
data. Notably, \textit{vertical federated learning} (VFL), where parties share
the same set of samples but only hold partial features, has a wide range of
real-world applications. However, existing studies in VFL rarely study the
``record linkage'' process. They either design algorithms assuming the data
from different parties have been linked or use simple linkage methods like
exact-linkage or top1-linkage. These approaches are unsuitable for many
applications, such as the GPS location and noisy titles requiring fuzzy
matching. In this paper, we design a novel similarity-based VFL framework,
FedSim, which is suitable for more real-world applications and achieves higher
performance on traditional VFL tasks. Moreover, we theoretically analyze the
privacy risk caused by sharing similarities. Our experiments on three synthetic
datasets and five real-world datasets with various similarity metrics show that
FedSim consistently outperforms other state-of-the-art baselines
Decreased thalamic monoamine availability in drug-induced parkinsonism
Drug-induced parkinsonism (DIP) is caused by a dopamine receptor blockade and is a major cause of misleading diagnosis of Parkinson's disease (PD). Striatal dopamine activity has been investigated widely in DIP; however, most studies with dopamine transporter imaging have focused on the clinical characteristics and prognosis. This study investigated differences in striatal subregional monoamine availability among patients with DIP, normal controls, and patients with early PD. Thirty-five DIP patients, the same number of age-matched PD patients, and 46 healthy controls were selected for this study. Parkinsonian motor status was examined. Brain magnetic resonance imaging and positron emission tomography with 18F-N-(3-fluoropropyl)-2beta-carbon ethoxy-3beta-(4-iodophenyl) nortropane were performed, and the regional standardized uptake values were analyzed with a volume-of-interest template and compared among the groups. The groups were evenly matched for age, but there were numerically more females in the DIP group. Parkinsonian motor symptoms were similar in the DIP and PD groups. Monoamine availability in the thalamus of the DIP group was lower than that of the normal controls and similar to that of the PD group. In other subregions (putamen, globus pallidus, and ventral striatum), monoamine availability in the DIP group and normal controls did not differ and was higher than that in the PD group. This difference compared to healthy subject suggests that low monoamine availability in the thalamus could be an imaging biomarker of DIP.ope
Gute Praxis Datenlinkage (GPD) : Good Practice Data Linkage
Das personenbezogene Verknüpfen verschiedener Datenquellen (Datenlinkage) für Forschungszwecke findet in den letzten Jahren in Deutschland zunehmend Anwendung. Jedoch fehlen hierfür konsentierte methodische Standards. Ziel dieses Beitrages ist es, solche Standards für Forschungsvorhaben zu definieren. Eine weitere Intention ist es, dem Lesenden eine Checkliste zur Bewertung geplanter Forschungsvorhaben und Artikel bereitzustellen. Zu diesem Zweck hat eine aus Mitgliedern verschiedener Fachgesellschaften zusammengesetzte Expertengruppe seit 2016 insgesamt 7 Leitlinien mit 27 konkreten Empfehlungen erstellt. Die Gute Praxis Datenlinkage beinhaltet die folgenden Leitlinien: (1) Forschungsziele, Fragestellung, Datenquellen und Ressourcen, (2) Dateninfrastruktur und Datenfluss, (3) Datenschutz, (4) Ethik, (5) Schlüsselvariablen und Linkageverfahren, (6) Datenprüfung/Qualitätssicherung sowie (7) Langfristige Datennutzung für noch festzulegende Fragestellungen. Jede Leitlinie wird ausführlich diskutiert. Zukünftige Aktualisierungen werden wissenschaftliche und datenschutzrechtliche Entwicklungen berücksichtigen
A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage
Today many application domains, such as national statistics,
healthcare, business analytic, fraud detection, and national
security, require data to be integrated from multiple databases.
Record linkage (RL) is a process used in data integration which
links multiple databases to identify matching records that belong
to the same entity. RL enriches the usefulness of data by
removing duplicates, errors, and inconsistencies which improves
the effectiveness of decision making in data analytic
applications.
Often, organisations are not willing or authorised to share the
sensitive information in their databases with any other party due
to privacy and confidentiality regulations. The linkage of
databases of different organisations is an emerging research area
known as privacy-preserving record linkage (PPRL). PPRL
facilitates the linkage of databases by ensuring the privacy of
the entities in these databases.
In multidatabase (MD) context, PPRL is significantly challenged
by the intrinsic exponential growth in the number of potential
record pair comparisons. Such linkage often requires significant
time and computational resources to produce the resulting
matching sets of records. Due to increased risk of collusion,
preserving the privacy of the data is more problematic with an
increase of number of parties involved in the linkage process.
Blocking is commonly used to scale the linkage of large
databases. The aim of blocking is to remove those record pairs
that correspond to non-matches (refer to different entities).
Many techniques have been proposed for RL and PPRL for blocking
two databases. However, many of these techniques are not suitable
for blocking multiple databases. This creates a need to develop
blocking technique for the multidatabase linkage context as
real-world applications increasingly require more than two
databases.
This thesis is the first to conduct extensive research on
blocking for multidatabase privacy-preserved record linkage
(MD-PPRL). We consider several research problems in blocking of
MD-PPRL. First, we start with a broad background literature on
PPRL. This allow us to identify the main research gaps that need
to be investigated in MD-PPRL. Second, we introduce a blocking
framework for MD-PPRL which provides more flexibility and control
to database owners in the block generation process. Third, we
propose different techniques that are used in our framework for
(1) blocking of multiple databases, (2) identifying blocks that
need to be compared across subgroups of these databases, and (3)
filtering redundant record pair comparisons by the efficient
scheduling of block comparisons to improve the scalability of
MD-PPRL. Each of these techniques covers an important aspect of
blocking in real-world MD-PPRL applications. Finally, this thesis
reports on an extensive evaluation of the combined application of
these methods with real datasets, which illustrates that they
outperform existing approaches in term of scalability, accuracy,
and privacy
Privacy-preserving matching of similar patients
The identification of similar entities represented by records in different databases has drawn considerable attention in many application areas, including in the health domain. One important type of entity matching application that is vital for quality healthcare analytics is the identification of similar patients, known as similar patient matching. A key component of identifying similar records is the calculation of
similarity of the values in attributes (fields) between these records. Due to increasing privacy and confidentiality
concerns, using the actual attribute values of patient records to identify similar records across different organizations is becoming non-trivial because the attributes in such records often contain highly sensitive information such as personal and medical details of patients. Therefore, the matching
needs to be based on masked (encoded) values while being effective and efficient to allow matching of large databases.
Bloom filter encoding has widely been used as an efficient masking technique for privacy-preserving matching of string and categorical values. However, no work on Bloom filter-based masking of numerical data, such as integer (e.g. age), floating point (e.g. body mass index), and modulus (numbers wrap around upon reaching a certain value, e.g. date and time), which are commonly required in the health domain, has been presented in the literature. We propose a framework with novel methods for masking numerical
data using Bloom filters, thereby facilitating the calculation of similarities between records. We conduct an empirical study on publicly available real-world datasets which shows that our framework provides efficient masking and achieves similar matching accuracy compared to the matching of actual unencoded patient records