49 research outputs found
Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges
The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.This work was partially funded by the Australian Research Council under Discovery Project DP130101801, the German Academic
Exchange Service (DAAD) and Universities Australia (UA) under the Joint Research Co-operation Scheme, and also funded by the German Federal Ministry
of Education and Research within the project Competence Center for Scalable
Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B)
SFour: A Protocol for Cryptographically Secure Record Linkage at Scale
The prevalence of various (and increasingly large) datasets presents the challenging problem of discovering common entities dispersed across disparate datasets. Solutions to the private record linkage problem (PRL) aim to enable such explorations of datasets in a secure manner.
A two-party PRL protocol allows two parties to determine for which entities they each possess a record (either an exact matching record or a fuzzy matching record) in their respective datasets — without revealing to one another information about any entities for which they do not both possess records. Although several solutions have been proposed to solve the PRL problem, no current solution offers a fully cryptographic security guarantee while maintaining both high accuracy of output and subquadratic runtime efficiency.
To this end, we propose the first known efficient PRL protocol that runs in subquadratic time, provides high accuracy, and guarantees cryptographic security
A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage
Today many application domains, such as national statistics,
healthcare, business analytic, fraud detection, and national
security, require data to be integrated from multiple databases.
Record linkage (RL) is a process used in data integration which
links multiple databases to identify matching records that belong
to the same entity. RL enriches the usefulness of data by
removing duplicates, errors, and inconsistencies which improves
the effectiveness of decision making in data analytic
applications.
Often, organisations are not willing or authorised to share the
sensitive information in their databases with any other party due
to privacy and confidentiality regulations. The linkage of
databases of different organisations is an emerging research area
known as privacy-preserving record linkage (PPRL). PPRL
facilitates the linkage of databases by ensuring the privacy of
the entities in these databases.
In multidatabase (MD) context, PPRL is significantly challenged
by the intrinsic exponential growth in the number of potential
record pair comparisons. Such linkage often requires significant
time and computational resources to produce the resulting
matching sets of records. Due to increased risk of collusion,
preserving the privacy of the data is more problematic with an
increase of number of parties involved in the linkage process.
Blocking is commonly used to scale the linkage of large
databases. The aim of blocking is to remove those record pairs
that correspond to non-matches (refer to different entities).
Many techniques have been proposed for RL and PPRL for blocking
two databases. However, many of these techniques are not suitable
for blocking multiple databases. This creates a need to develop
blocking technique for the multidatabase linkage context as
real-world applications increasingly require more than two
databases.
This thesis is the first to conduct extensive research on
blocking for multidatabase privacy-preserved record linkage
(MD-PPRL). We consider several research problems in blocking of
MD-PPRL. First, we start with a broad background literature on
PPRL. This allow us to identify the main research gaps that need
to be investigated in MD-PPRL. Second, we introduce a blocking
framework for MD-PPRL which provides more flexibility and control
to database owners in the block generation process. Third, we
propose different techniques that are used in our framework for
(1) blocking of multiple databases, (2) identifying blocks that
need to be compared across subgroups of these databases, and (3)
filtering redundant record pair comparisons by the efficient
scheduling of block comparisons to improve the scalability of
MD-PPRL. Each of these techniques covers an important aspect of
blocking in real-world MD-PPRL applications. Finally, this thesis
reports on an extensive evaluation of the combined application of
these methods with real datasets, which illustrates that they
outperform existing approaches in term of scalability, accuracy,
and privacy
Embedding Techniques to Solve Large-scale Entity Resolution
Entity resolution (ER) identifies and links records that belong to the same real-world entities, where an entity refer to any real-world object. It is a primary task in data integration. Accurate and efficient ER substantially impacts various commercial, security, and scientific applications. Often, there are no unique identifiers for entities in datasets/databases that would make the ER task easy. Therefore record matching depends on entity identifying attributes and approximate matching techniques. The issues of efficiently handling large-scale data remain an open research problem with the increasing volumes and velocities in modern data collections. Fast, scalable, real-time and approximate entity matching techniques that provide high-quality results are highly demanding. This thesis proposes solutions to address the challenges of lack of test datasets and the demand for fast indexing algorithms in large-scale ER. The shortage of large-scale, real-world datasets with ground truth is a primary concern in developing and testing new ER algorithms. Usually, for many datasets, there is no information on the ground truth or ‘gold standard’ data that specifies if two records correspond to the same entity or not. Moreover, obtaining test data for ER algorithms that use personal identifying keys (e.g., names, addresses) is difficult due to privacy and confidentiality issues. Towards this challenge, we proposed a numerical simulation model that produces realistic large-scale data to test new methods when suitable public datasets are unavailable. One of the important findings of this work is the approximation of vectors that represent entity identification keys and their relationships, e.g., dissimilarities and errors. Indexing techniques reduce the search space and execution time in the ER process. Based on the ideas of the approximate vectors of entity identification keys, we proposed a fast indexing technique (Em-K indexing) suitable for real-time, approximate entity matching in large-scale ER. Our Em-K indexing method provides a quick and accurate block of candidate matches for a querying record by searching an existing reference database. All our solutions are metric-based. We transform metric or non-metric spaces to a lowerdimensional Euclidean space, known as configuration space, using multidimensional scaling (MDS). This thesis discusses how to modify MDS algorithms to solve various ER problems efficiently. We proposed highly efficient and scalable approximation methods that extend the MDS algorithm for large-scale datasets. We empirically demonstrate the improvements of our proposed approaches on several datasets with various parameter settings. The outcomes show that our methods can generate large-scale testing data, perform fast real-time and approximate entity matching, and effectively scale up the mapping capacity of MDS.Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 202