5,431 research outputs found

    Graph-based Household Matching for Linking Census Data

    Get PDF
    Historical censuses consist of individual facts about a community. It provides knowledge concerned with the nation’s population. These data apply the reconstruction features of a specific period to trace their ancestors and families changes over time. Linking census data is a difficult task as common names, data quality and household changes over time. During the decades, a household may split multiple households due to marriage or move to another household. This paper proposes a graph-based approach to link households, which takes the relationship between household members. Using individual record linking results, the proposed method builds household graphs, so that the matches are determined by attribute similarity and records relationship similarity. According to the experimental results, the proposed method reaches an F-score of 0.974on Ireland Census data, outperforming all alternative methods being compared

    Advanced Methods for Entity Linking in the Life Sciences

    Get PDF
    The amount of knowledge increases rapidly due to the increasing number of available data sources. However, the autonomy of data sources and the resulting heterogeneity prevent comprehensive data analysis and applications. Data integration aims to overcome heterogeneity by unifying different data sources and enriching unstructured data. The enrichment of data consists of different subtasks, amongst other the annotation process. The annotation process links document phrases to terms of a standardized vocabulary. Annotated documents enable effective retrieval methods, comparability of different documents, and comprehensive data analysis, such as finding adversarial drug effects based on patient data. A vocabulary allows the comparability using standardized terms. An ontology can also represent a vocabulary, whereas concepts, relationships, and logical constraints additionally define an ontology. The annotation process is applicable in different domains. Nevertheless, there is a difference between generic and specialized domains according to the annotation process. This thesis emphasizes the differences between the domains and addresses the identified challenges. The majority of annotation approaches focuses on the evaluation of general domains, such as Wikipedia. This thesis evaluates the developed annotation approaches with case report forms that are medical documents for examining clinical trials. The natural language provides different challenges, such as similar meanings using different phrases. The proposed annotation method, AnnoMap, considers the fuzziness of natural language. A further challenge is the reuse of verified annotations. Existing annotations represent knowledge that can be reused for further annotation processes. AnnoMap consists of a reuse strategy that utilizes verified annotations to link new documents to appropriate concepts. Due to the broad spectrum of areas in the biomedical domain, different tools exist. The tools perform differently regarding a particular domain. This thesis proposes a combination approach to unify results from different tools. The method utilizes existing tool results to build a classification model that can classify new annotations as correct or incorrect. The results show that the reuse and the machine learning-based combination improve the annotation quality compared to existing approaches focussing on the biomedical domain. A further part of data integration is entity resolution to build unified knowledge bases from different data sources. A data source consists of a set of records characterized by attributes. The goal of entity resolution is to identify records representing the same real-world entity. Many methods focus on linking data sources consisting of records being characterized by attributes. Nevertheless, only a few methods can handle graph-structured knowledge bases or consider temporal aspects. The temporal aspects are essential to identify the same entities over different time intervals since these aspects underlie certain conditions. Moreover, records can be related to other records so that a small graph structure exists for each record. These small graphs can be linked to each other if they represent the same. This thesis proposes an entity resolution approach for census data consisting of person records for different time intervals. The approach also considers the graph structure of persons given by family relationships. For achieving qualitative results, current methods apply machine-learning techniques to classify record pairs as the same entity. The classification task used a model that is generated by training data. In this case, the training data is a set of record pairs that are labeled as a duplicate or not. Nevertheless, the generation of training data is a time-consuming task so that active learning techniques are relevant for reducing the number of training examples. The entity resolution method for temporal graph-structured data shows an improvement compared to previous collective entity resolution approaches. The developed active learning approach achieves comparable results to supervised learning methods and outperforms other limited budget active learning methods. Besides the entity resolution approach, the thesis introduces the concept of evolution operators for communities. These operators can express the dynamics of communities and individuals. For instance, we can formulate that two communities merged or split over time. Moreover, the operators allow observing the history of individuals. Overall, the presented annotation approaches generate qualitative annotations for medical forms. The annotations enable comprehensive analysis across different data sources as well as accurate queries. The proposed entity resolution approaches improve existing ones so that they contribute to the generation of qualitative knowledge graphs and data analysis tasks

    Linking historical census data across time

    No full text
    Historical census data provide a snapshot of the era when our ancestors lived. Such data contain valuable information for the reconstruction of households and the tracking of family changes across time, which can be used for a variety of social science research projects. As valuable as they are, these data provide only snapshots of the main characteristics of the stock of a population. To capture household changes requires that we link person by person and household by household from one census to the next over a series of censuses. Once linked together, the census data are greatly enhanced in value. Development of an automatic or semi-automatic linking procedure will significantly relieve social scientists from the tedious task of manually linking individuals, families, and households, and can lead to an improvement of their productivity. In this thesis, a systematic solution is proposed for linking historical census data that integrates data cleaning and standardisation, as well as record and household linkage over consecutive censuses. This solution consists of several data pre-processing, machine learning, and data mining methods that address different aspects of the historical census data linkage problem. A common property of these methods is that they all adopt a strategy to consider a household as an entity, and use the whole of household information to improve the effectiveness of data cleaning and the accuracy of record and household linkage. We first proposal an approach for automatic cleaning and linking using domain knowledge. The core idea is to use household information in both the cleaning and linking steps, so that records that contain errors and variations can be cleaned and standardised and the number of wrongly linked records can be reduced. Second, we introduce a group linking method into household linkage, which enables tracking of the majority of members in a household over a period of time. The proposed method is based on the outcome of the record linkage step using either a similarity based method or a machine learning approach. A group linking method is then applied, aiming to reduce ambiguity of multiple household linkages. Third, we introduce a graph-based method to link households, which takes the structural relationship between household members into consideration. Based on the results of linking individual records, our method builds a graph for each household, so that the matches of household's in different census are determined by both attribute relationship and record similarities. This allows household similarities be more accurately calculated. Finally, we describe an instance classification method based on a multiple instance learning method. This allows an integrated solution to link both households and individual records at the same time. Our method treats group links as bags and individual record links as instances. We extend multiple instance learning from bag to instance classification in order to allow the reconstruction of bags from candidate instances. The classified bag and instance samples lead to a significant reduction in multiple group links, thereby improving the overall quality of linked data

    Record linkage of Norwegian historical census data using machine learning

    Get PDF
    For errata and source code: https://github.com/uit-hdl/rhd-linking.The Historical Population Register (HPR) is a project to build the longitudinal life history of individuals by integrating the historical records of the people in Norway since the 19th century. This study attempted to improve the linking rate between the 1875-1900 censuses in HPR, which is currently low, using machine learning approaches. To this end, I developed a machine learning model for linking that is suitable for the Norwegian census and tested various algorithms, feature sets, and match selection options. I compared the results in terms of performance and match size, and also examined their representativeness to the entire population. The study results showed that the linking rate of HPR can be significantly improved by machine learning approaches while maintaining high accuracy. In addition, this study presented a reference for future use by demonstrating how the performance varies depending on the feature set and match selection. On the other hand, this study also revealed that linked data generally do not represent the population of the census, and the characteristics and degree of bias vary depending on the linking algorithm, suggesting that caution is needed when using linked data for research

    Three Essays on the Substance and Methods of Economic History

    Full text link
    This dissertation explores questions on the substance and methods of economic history. Chapter one studies a little-known policy change in the late 19th and early 20th centuries to explore the causal effects of political exclusion on the economic wellbeing of immigrants. Starting in the mid-19th century, twenty-four states and territories expanded their electorates to allow non-citizen immigrants the right to vote; from 1864-1926, however, these same jurisdictions reversed this policy, creating a mass disenfranchisement for which the timing varied across states. Using this variation as well as a discontinuity in nationalization proceedings of the era, I find that political exclusion led to a 25-60% reduction in the likelihood that affected immigrants obtained public sector employment. I also document significant negative intergenerational effects: individuals of immigrant parentage born around the time of disenfranchisement earned 5-9% less as adults than comparable individuals of native parentage. I am able to rule out as mechanisms for this intergenerational effect a variety of policy and spending channels, but find evidence for a reduction in English-language proficiency among disenfranchised immigrants, which may have adversely affected the human capital of their children. Chapter two explores the causes of the adoption and repeal of alien voting in the United States. This policy shift offers a valuable opportunity to understand the forces determining political inclusion and exclusion in a formative period of American democracy, and contributes to the broader literature on theories of democratization. I use qualitative evidence from the historical record to outline competing theories of both adoption and repeal of alien voting, and then rationalize these hypotheses within the context of a median voting model. Using a discrete time hazard specification, I find evidence consistent with the hypothesis that states used alien voting as a locational amenity, with the objective of inducing immigrant in-migration in order to foster agricultural development. The results indicate that the timing of repeal was driven by social costs, rather than economic or political factors, although there is evidence for heterogeneity in correlates of support for repeal across states. Finally, the costs of constitutional change were salient for both adoption and repeal: states for which it was less costly to re-write or amend the constitution were more likely to adopt and repeal alien voting. Chapter three is a co-authored methodological study intended to assess the efficacy of commonly used techniques to create name-linked historical datasets. The recent digitization of historical microdata has led to a proliferation of research using linked data, in which researchers use various methods to match individuals across datasets by observable characteristics; less is known, however, about the quality of the data produced using those different methods. Using two hand-linked ground-truth samples, we assess the performance of four automated linking methods and two commonly used name-cleaning algorithms. Results indicate that automated methods result in high rates of false matches – ranging from 17 to over 60 percent – and the use of phonetic name cleaning increases false match rate by 60-100 percent across methods. We conclude by exploring the implications of erroneous matches for inference, and estimate intergenerational income elasticities for father-son pairs in the 1940 Census using samples generated by each method. We find that estimates vary with linking method, suggesting that caution must be used when interpreting parameters estimated from linked data.PHDEconomicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/138794/1/morghend_1.pd

    Automated linking of historical data

    Full text link
    The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.Accepted manuscriptFirst author draf

    Effective Record Linkage Techniques for Complex Population Data

    Get PDF
    Real-world data sets are generally of limited value when analysed on their own, whereas the true potential of data can be exploited only when two or more data sets are linked to analyse patterns across records. A classic example is the need for merging medical records with travel data for effective surveillance and management of pandemics such as COVID-19 by tracing points of contacts of infected individuals. Therefore, Record Linkage (RL), which is the process of identifying records that refer to the same entity, is an area of data science that is of paramount importance in the quest for making informed decisions based on the plethora of information available in the modern world. Two of the primary concerns of RL are obtaining linkage results of high quality, and maximising efficiency. Furthermore, the lack of ground-truth data in the form of known matches and non-matches, and the privacy concerns involved in linking sensitive data have hindered the application of RL in real-world projects. In traditional RL, methods such as blocking and indexing are generally applied to improve efficiency by reducing the number of record pairs that need to be compared. Once the record pairs retained from blocking are compared, certain classification methods are employed to separate matches from non-matches. Thus, the general RL process comprises of blocking, comparison, classification, and finally evaluation to assess how well a linkage program has performed. In this thesis we initially provide a holistic understanding of the background of RL, and then conduct an extensive literature review of the state-of-the-art techniques applied in RL to identify current research gaps. Next, we present our initial contribution of incorporating data characteristics, such as temporal and geographic information with unsupervised clustering, which achieves significant improvements in precision (more than 16%), at the cost of minor reduction in recall (less than 2.5%) when they are applied on real-world data sets compared to using regular unsupervised clustering. We then present a novel active learning-based method to filter record pairs subsequent to the record pair comparison step to improve the efficiency of the RL process. Furthermore, we develop a novel active learning-based classification technique for RL which allows to obtain high quality linkage results with limited ground-truth data. Even though semi-supervised learning techniques such as active learning methods have already been proposed in the context of RL, this is a relatively novel paradigm which is worthy of further exploration. We experimentally show more than 35% improvement in clustering efficiency with the application of our proposed filtering approach; and linkage quality on par with or exceeding existing active learning-based classification methods, compared to our active learning-based classification technique. Existing RL evaluation measures such as precision and recall evaluate the classification outcome of record pairs, which can cause ambiguity when applied in the group RL context. We therefore propose a more robust RL evaluation measure which evaluates linkage quality based on how individual records have been assigned to clusters rather than considering record pairs. Next, we propose a novel graph anonymisation technique that extends the literature by introducing methods of anonymising data to be linked in a human interpretable manner, without compromising structure and interpretability of the data as with existing state-of-the-art anonymisation approaches. We experimentally show how the similarity distributions are maintained in anonymised and original sensitive data sets when our anonymisation technique is applied, which attests to its ability to maintain the structure of the original data. We finally conduct an empirical evaluation of our proposed techniques and show how they outperform existing RL methods
    • …
    corecore