3 research outputs found

    Aspects of Record Linkage

    Get PDF
    This thesis is an exploration of the subject of historical record linkage. The general goal of historical record linkage is to discover relations between historical entities in a database, for any specific definition of relation, entity and database. Although this task originates from historical research, multiple disciplines are involved. Increasing volumes of data necessitate the use of automated or semi-automated linkage procedures, which is in the domain of computer science. Linkage methodologies depend heavily on the nature of the data itself, often requiring analysis based on onomastics (i.e., the study of person names) or general linguistics. To understand the dynamics of natural language one could be tempted to look at the source of language, i.e., humans, either on the individual cognitive level or as group behaviour. This further increases the multidisciplinarity of the subject by including cognitive psychology. Every discipline addresses a subset of problem aspects, all of which can contribute either to practical solutions for linkage problems or to further insights into the subject matter.Algorithms and the Foundations of Software technolog

    Advanced Entity Resolution Techniques

    Get PDF
    Entity resolution is the task of determining which records in one or more data sets correspond to the same real-world entities. Entity resolution is an important problem with a range of applications for government agencies, commercial organisations, and research institutions. Due to the important practical applications and many open challenges, entity resolution is an active area of research and a variety of techniques have been developed for each part of the entity resolution process. This thesis is about trying to improve the viability of sophisticated entity resolution techniques for real-world entity resolution problems. Collective entity resolution techniques are a subclass of entity resolution approaches that incorporate relationships into the entity resolution process and introduce dependencies between matching decisions. Group linkage techniques match multiple related records at the same time. Temporal entity resolution techniques incorporate changing attribute values and relationships into the entity resolution process. Population reconstruction techniques match records with different entity roles and very limited information in the presence of domain constraints. Sophisticated entity resolution techniques such as these produce good results when applied to small data sets in an academic environment. However, they suffer from a number of limitations which make them harder to apply to real-world problems. In this thesis, we aim to address several of these limitations with the goal that this will enable such advanced entity resolution techniques to see more use in practical applications. One of the main limitations of existing advanced entity resolution techniques is poor scalability. We propose a novel size-constrained blocking framework, that allows the user to set minimum and maximum block-size thresholds, and then generates blocks where the number of records in each block is within the size range. This allows efficiency requirements to be met, improves parallelisation, and allows expensive techniques with poor scalability such as Markov logic networks to be used. Another significant limitation of advanced entity resolution techniques in practice is a lack of training data. Collective entity resolution techniques make use of relationship information so a bootstrapping process is required in order to generate initial relationships. Many techniques for temporal entity resolution, group linkage and population reconstruction also require training data. In this thesis we propose a novel approach for automatically generating high quality training data using a combination of domain constraints and ambiguity. We also show how we can incorporate these constraints and ambiguity measures into active learning to further improve the training data set. We also address the problem of parameter tuning and evaluation. Advanced entity resolution approaches typically have a large number of parameters that need to be tuned for good performance. We propose a novel approach using transitive closure that eliminates unsound parameter choices in the blocking and similarity calculation steps and reduces the number of iterations of the entity resolution process and the corresponding evaluation. Finally, we present a case study where we extend our training data generation approach for situations where relationships exist between records. We make use of the relationship information to validate the matches generated by our technique, and we also extend the concept of ambiguity to cover groups, allowing us to increase the size of the generated set of matches. We apply this approach to a very complex and challenging data set of population registry data and demonstrate that we can still create high quality training data when other approaches are inadequate

    Record Linkage Using Graph Consistency

    No full text
    corecore