8,145 research outputs found

    Linking historical census data across time

    No full text
    Historical census data provide a snapshot of the era when our ancestors lived. Such data contain valuable information for the reconstruction of households and the tracking of family changes across time, which can be used for a variety of social science research projects. As valuable as they are, these data provide only snapshots of the main characteristics of the stock of a population. To capture household changes requires that we link person by person and household by household from one census to the next over a series of censuses. Once linked together, the census data are greatly enhanced in value. Development of an automatic or semi-automatic linking procedure will significantly relieve social scientists from the tedious task of manually linking individuals, families, and households, and can lead to an improvement of their productivity. In this thesis, a systematic solution is proposed for linking historical census data that integrates data cleaning and standardisation, as well as record and household linkage over consecutive censuses. This solution consists of several data pre-processing, machine learning, and data mining methods that address different aspects of the historical census data linkage problem. A common property of these methods is that they all adopt a strategy to consider a household as an entity, and use the whole of household information to improve the effectiveness of data cleaning and the accuracy of record and household linkage. We first proposal an approach for automatic cleaning and linking using domain knowledge. The core idea is to use household information in both the cleaning and linking steps, so that records that contain errors and variations can be cleaned and standardised and the number of wrongly linked records can be reduced. Second, we introduce a group linking method into household linkage, which enables tracking of the majority of members in a household over a period of time. The proposed method is based on the outcome of the record linkage step using either a similarity based method or a machine learning approach. A group linking method is then applied, aiming to reduce ambiguity of multiple household linkages. Third, we introduce a graph-based method to link households, which takes the structural relationship between household members into consideration. Based on the results of linking individual records, our method builds a graph for each household, so that the matches of household's in different census are determined by both attribute relationship and record similarities. This allows household similarities be more accurately calculated. Finally, we describe an instance classification method based on a multiple instance learning method. This allows an integrated solution to link both households and individual records at the same time. Our method treats group links as bags and individual record links as instances. We extend multiple instance learning from bag to instance classification in order to allow the reconstruction of bags from candidate instances. The classified bag and instance samples lead to a significant reduction in multiple group links, thereby improving the overall quality of linked data

    Graph-based Household Matching for Linking Census Data

    Get PDF
    Historical censuses consist of individual facts about a community. It provides knowledge concerned with the nation’s population. These data apply the reconstruction features of a specific period to trace their ancestors and families changes over time. Linking census data is a difficult task as common names, data quality and household changes over time. During the decades, a household may split multiple households due to marriage or move to another household. This paper proposes a graph-based approach to link households, which takes the relationship between household members. Using individual record linking results, the proposed method builds household graphs, so that the matches are determined by attribute similarity and records relationship similarity. According to the experimental results, the proposed method reaches an F-score of 0.974on Ireland Census data, outperforming all alternative methods being compared

    Automated linking of historical data

    Full text link
    The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.Accepted manuscriptFirst author draf

    Creating a nationally representative individual and household sample for Great Britain, 1851 to 1901: the Victorian Panel Study (VPS)

    Full text link
    'This publication is a direct result of an earlier scoping study undertaken for the ESRC's Research Resources Board which investigated the potential for creating a new longitudinal database of individuals and households for the period 1851 to 1901 - the Victorian Panel Study (VPS). The basic concept of the VPS is to create a unique longitudinal database of individuals and households for Great Britain spanning the period 1851-1901. The proposed VPS project raises a number of methodological and logistical challenges, and it is these which are the focus of this publication. The basic idea of the VPS is simple in concept. It would take as its base the individuals and households recorded in the existing ESRC-funded computerised national two per cent sample of the 1851 British census, created by Professor Michael Anderson, and trace these through subsequent registration and census information for the fifty-year period to 1901. The result would be a linked database with each census year between 1851 and 1901 in essence acting as a surrogate 'wave', associated with information from registration events that occurred between census years. Although the idea of a VPS can be expressed in this short and simple fashion, designing and planning it, together with identifying and justifying the resources necessary to create it, is a complex set of tasks, and it is these which this publication seeks to address. The primary aims and objectives of the project described in this publication were essentially as follows: to estimate the potential user demand for a VPS and examine the uses to which it may be put; to test the suitability of the existing 1851 census sample as an appropriate starting point for a VPS; to test differing sampling and methodological issues; to investigate record-linkage strategies; to investigate the relationship between the VPS and other longitudinal data projects (both contemporary and historical); and to recommend a framework and strategy for creating a full VPS. The structure and contents of this publication follow this basic project plan.' (author's abstract

    Occode: an end-to-end machine learning pipeline for transcription of historical population censuses

    Get PDF
    Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end machine learning pipeline that scales to the dataset size, and a model that achieves high accuracy with few manual transcriptions. In addition, the correctness of the model results must be verified. This paper describes our lessons learned developing, tuning, and using the Occode end-to-end machine learning pipeline for transcribing 7,3 million rows with handwritten occupation codes in the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification. We verify that the occupation code distribution found in our result matches the distribution found in our training data which should be representative for the census as a whole. We believe our approach and lessons learned are useful for other transcription projects that plan to use machine learning in production. The source code is available at: https://github.com/uit-hdl/rhd-code

    Advanced Methods for Entity Linking in the Life Sciences

    Get PDF
    The amount of knowledge increases rapidly due to the increasing number of available data sources. However, the autonomy of data sources and the resulting heterogeneity prevent comprehensive data analysis and applications. Data integration aims to overcome heterogeneity by unifying different data sources and enriching unstructured data. The enrichment of data consists of different subtasks, amongst other the annotation process. The annotation process links document phrases to terms of a standardized vocabulary. Annotated documents enable effective retrieval methods, comparability of different documents, and comprehensive data analysis, such as finding adversarial drug effects based on patient data. A vocabulary allows the comparability using standardized terms. An ontology can also represent a vocabulary, whereas concepts, relationships, and logical constraints additionally define an ontology. The annotation process is applicable in different domains. Nevertheless, there is a difference between generic and specialized domains according to the annotation process. This thesis emphasizes the differences between the domains and addresses the identified challenges. The majority of annotation approaches focuses on the evaluation of general domains, such as Wikipedia. This thesis evaluates the developed annotation approaches with case report forms that are medical documents for examining clinical trials. The natural language provides different challenges, such as similar meanings using different phrases. The proposed annotation method, AnnoMap, considers the fuzziness of natural language. A further challenge is the reuse of verified annotations. Existing annotations represent knowledge that can be reused for further annotation processes. AnnoMap consists of a reuse strategy that utilizes verified annotations to link new documents to appropriate concepts. Due to the broad spectrum of areas in the biomedical domain, different tools exist. The tools perform differently regarding a particular domain. This thesis proposes a combination approach to unify results from different tools. The method utilizes existing tool results to build a classification model that can classify new annotations as correct or incorrect. The results show that the reuse and the machine learning-based combination improve the annotation quality compared to existing approaches focussing on the biomedical domain. A further part of data integration is entity resolution to build unified knowledge bases from different data sources. A data source consists of a set of records characterized by attributes. The goal of entity resolution is to identify records representing the same real-world entity. Many methods focus on linking data sources consisting of records being characterized by attributes. Nevertheless, only a few methods can handle graph-structured knowledge bases or consider temporal aspects. The temporal aspects are essential to identify the same entities over different time intervals since these aspects underlie certain conditions. Moreover, records can be related to other records so that a small graph structure exists for each record. These small graphs can be linked to each other if they represent the same. This thesis proposes an entity resolution approach for census data consisting of person records for different time intervals. The approach also considers the graph structure of persons given by family relationships. For achieving qualitative results, current methods apply machine-learning techniques to classify record pairs as the same entity. The classification task used a model that is generated by training data. In this case, the training data is a set of record pairs that are labeled as a duplicate or not. Nevertheless, the generation of training data is a time-consuming task so that active learning techniques are relevant for reducing the number of training examples. The entity resolution method for temporal graph-structured data shows an improvement compared to previous collective entity resolution approaches. The developed active learning approach achieves comparable results to supervised learning methods and outperforms other limited budget active learning methods. Besides the entity resolution approach, the thesis introduces the concept of evolution operators for communities. These operators can express the dynamics of communities and individuals. For instance, we can formulate that two communities merged or split over time. Moreover, the operators allow observing the history of individuals. Overall, the presented annotation approaches generate qualitative annotations for medical forms. The annotations enable comprehensive analysis across different data sources as well as accurate queries. The proposed entity resolution approaches improve existing ones so that they contribute to the generation of qualitative knowledge graphs and data analysis tasks

    Linking Scottish vital event records using family groups

    Get PDF
    Funding: This work was supported by ESRC Grants ES/K00574X/2 “Digitising Scotland” and ES/L007487/1 “Administrative Data Research Centre – Scotland.”The reconstitution of populations through linkage of historical records is a powerful approach to generate longitudinal historical microdata resources of interest to researchers in various fields. Here we consider automated linking of the vital events recorded in the civil registers of birth, death and marriage compiled in Scotland, to bring together the various records associated with the demographic events in the life course of each individual in the population. From the histories, the genealogical structure of the population can then be built up. Rather than apply standard linkage techniques to link the individuals on the available certificates, we explore an alternative approach, inspired by the family reconstitution techniques adopted by historical demographers, in which the births of siblings are first linked to form family groups, after which intergenerational links between families can be established. We report a small-scale evaluation of this approach, using two district-level data sets from Scotland in the late nineteenth century, for which sibling links have already been created by demographers. We show that quality measures of up to 83% can be achieved on these data sets (using F-Measure, a combination of precision and recall). In the future, we intend to compare the results with a standard linkage approach and to investigate how these various methods may be used in a project which aims to link the entire Scottish population from 1856 to 1973.PostprintPeer reviewe

    Three Essays on the Substance and Methods of Economic History

    Full text link
    This dissertation explores questions on the substance and methods of economic history. Chapter one studies a little-known policy change in the late 19th and early 20th centuries to explore the causal effects of political exclusion on the economic wellbeing of immigrants. Starting in the mid-19th century, twenty-four states and territories expanded their electorates to allow non-citizen immigrants the right to vote; from 1864-1926, however, these same jurisdictions reversed this policy, creating a mass disenfranchisement for which the timing varied across states. Using this variation as well as a discontinuity in nationalization proceedings of the era, I find that political exclusion led to a 25-60% reduction in the likelihood that affected immigrants obtained public sector employment. I also document significant negative intergenerational effects: individuals of immigrant parentage born around the time of disenfranchisement earned 5-9% less as adults than comparable individuals of native parentage. I am able to rule out as mechanisms for this intergenerational effect a variety of policy and spending channels, but find evidence for a reduction in English-language proficiency among disenfranchised immigrants, which may have adversely affected the human capital of their children. Chapter two explores the causes of the adoption and repeal of alien voting in the United States. This policy shift offers a valuable opportunity to understand the forces determining political inclusion and exclusion in a formative period of American democracy, and contributes to the broader literature on theories of democratization. I use qualitative evidence from the historical record to outline competing theories of both adoption and repeal of alien voting, and then rationalize these hypotheses within the context of a median voting model. Using a discrete time hazard specification, I find evidence consistent with the hypothesis that states used alien voting as a locational amenity, with the objective of inducing immigrant in-migration in order to foster agricultural development. The results indicate that the timing of repeal was driven by social costs, rather than economic or political factors, although there is evidence for heterogeneity in correlates of support for repeal across states. Finally, the costs of constitutional change were salient for both adoption and repeal: states for which it was less costly to re-write or amend the constitution were more likely to adopt and repeal alien voting. Chapter three is a co-authored methodological study intended to assess the efficacy of commonly used techniques to create name-linked historical datasets. The recent digitization of historical microdata has led to a proliferation of research using linked data, in which researchers use various methods to match individuals across datasets by observable characteristics; less is known, however, about the quality of the data produced using those different methods. Using two hand-linked ground-truth samples, we assess the performance of four automated linking methods and two commonly used name-cleaning algorithms. Results indicate that automated methods result in high rates of false matches – ranging from 17 to over 60 percent – and the use of phonetic name cleaning increases false match rate by 60-100 percent across methods. We conclude by exploring the implications of erroneous matches for inference, and estimate intergenerational income elasticities for father-son pairs in the 1940 Census using samples generated by each method. We find that estimates vary with linking method, suggesting that caution must be used when interpreting parameters estimated from linked data.PHDEconomicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/138794/1/morghend_1.pd
    • 

    corecore