4,583 research outputs found

    Data Integration: Techniques and Evaluation

    Get PDF
    Within the DIECOFIS framework, ec3, the Division of Business Statistics from the Vienna University of Economics and Business Administration and ISTAT worked together to find methods to create a comprehensive database of enterprise data required for taxation microsimulations via integration of existing disparate enterprise data sources. This paper provides an overview of the broad spectrum of investigated methodology (including exact and statistical matching as well as imputation) and related statistical quality indicators, and emphasises the relevance of data integration, especially for official statistics, as a means of using available information more efficiently and improving the quality of a statistical agency's products. Finally, an outlook on an empirical study comparing different exact matching procedures in the maintenance of Statistics Austria's Business Register is presented

    Quality and complexity measures for data linkage and deduplication

    Get PDF
    Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures

    A Bayesian Approach to Graphical Record Linkage and De-duplication

    Full text link
    We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture-recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature.Comment: 39 pages, 8 figures, 8 tables. Longer version of arXiv:1403.0211, In press, Journal of the American Statistical Association: Theory and Methods (2015

    Entity Resolution using Convolutional Neural Network

    Get PDF
    Entity resolution is an important application in field of data cleaning. Standard approaches like deterministic methods and probabilistic methods are generally used for this purpose. Many new approaches using single layer perceptron, crowdsourcing etc. are developed to improve the efficiency and also to reduce the time of entity resolution. The approaches used for this purpose also depend on the type of dataset, labeled or unlabeled. This paper presents a new method for labeled data which uses single layered convolutional neural network to perform entity resolution. It also describes how crowdsourcing can be used with the output of the convolutional neural network to further improve the accuracy of the approach while minimizing the cost of crowdsourcing. The paper also discusses the data pre-processing steps used for training the convolutional neural network. Finally it describes the airplane sensor dataset which is used for demonstration of this approach and then shows the experimental results achieved using convolutional neural network

    A hierarchical Bayesian approach to record linkage and population size problems

    Full text link
    We propose and illustrate a hierarchical Bayesian approach for matching statistical records observed on different occasions. We show how this model can be profitably adopted both in record linkage problems and in capture--recapture setups, where the size of a finite population is the real object of interest. There are at least two important differences between the proposed model-based approach and the current practice in record linkage. First, the statistical model is built up on the actually observed categorical variables and no reduction (to 0--1 comparisons) of the available information takes place. Second, the hierarchical structure of the model allows a two-way propagation of the uncertainty between the parameter estimation step and the matching procedure so that no plug-in estimates are used and the correct uncertainty is accounted for both in estimating the population size and in performing the record linkage. We illustrate and motivate our proposal through a real data example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Identification, data combination and the risk of disclosure

    Get PDF
    Businesses routinely rely on econometric models to analyze and predict consumer behavior. Estimation of such models may require combining a firm's internal data with external datasets to take into account sample selection, missing observations, omitted variables and errors in measurement within the existing data source. In this paper we point out that these data problems can be addressed when estimating econometric models from combined data using the data mining techniques under mild assumptions regarding the data distribution. However, data combination leads to serious threats to security of consumer data: we demonstrate that point identification of an econometric model from combined data is incompatible with restrictions on the risk of individual disclosure. Consequently, if a consumer model is point identified, the firm would (implicitly or explicitly) reveal the identity of at least some of consumers in its internal data. More importantly, we provide an argument that unless the firm places a restriction on the individual disclosure risk when combining data, even if the raw combined dataset is not shared with a third party, an adversary or a competitor can gather confidential information regarding some individuals from the estimated model.

    Changes in mortality patterns and associated socioeconomic differentials in a rural South African setting: findings from population surveillance in Agincourt, 1993-2013

    Get PDF
    A thesis submitted to the Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Doctor of Philosophy (by publications) 20th December 2017.Understanding a population’s mortality and disease patterns and their determinants is important for setting locally-relevant health and development priorities, identifying critical elements for strengthening of health systems, and determining the focus of health services and programmes. This thesis investigates changes in socioeconomic status (SES), cause composition of overall mortality and the socioeconomic patterning of mortality that occurred in a rural population in Agincourt, northeast South Africa over the period 1993-2013 using Health and Demographic Surveillance Systems (HDSS) data. It also assesses the feasibility of applying record linkage techniques to integrate data from HDSS and health facilities in order to enhance the utility of HDSS data for studying mortality and disease patterns and their determinants and implications in populations in resource-poor settings where vital registration systems are often weak. Results show a steady increase in the proportion of households that own assets associated with greater modern wealth and convergence towards the middle of the SES distribution over the period 2001-2013. However, improvements in SES were slower for poorer households and persistently varied by ethnicity with former Mozambican refugees being at a disadvantage. The population experienced steady and substantial increase in overall and communicable diseases related mortality from the mid-1990s to the mid-2000s, peaking around 2005-07 due to the HIV/AIDS epidemic. Overall mortality steadily declined afterwards following reduction in HIV/AIDS-related mortality due to the widespread introduction of free antiretroviral therapy (ART) available from public health facilities. By 2013, however, the cause of death distribution was yet to reach the levels it occupied in the early 1990s. Overall, the poorest individuals in the population experienced the highest mortality burden and HIV/AIDS and tuberculosis mortality persistently showed an inverse relation with SES throughout the period 2001-13. Although mortality from non-communicable diseases (NCDs) increased over time in both sexes and injuries were a prominent cause of death in males, neither of these causes of death showed consistent significant associations with household SES. A hybrid approach of deterministic followed by probabilistic record linkage, and the use of an extended set of conventional identifiers that included another household member’s first name yielded the best results for linking data from the Agincourt HDSS and health facilities with a sensitivity of 83.6% and a positive predictive value (PPV) of 95.1% for the best fully automated approach. In general, the findings highlight the need to identify the chronically poorest individuals and target them with interventions that can improve their SES and take them out of the vicious circle of poverty. The results also highlight the need for integrated health-care planning and programme delivery strategies to increase access to and uptake of HIV testing, linkage to care and ART, and prevention and treatment of NCDs especially among the poorest individuals to reduce the inequalities in cause-specific and overall mortality. The findings also contribute to the evidence base to inform further refinement and advancement of the health and epidemiological transition theory. Furthermore, the findings demonstrate the feasibility of linking HDSS data with data from health facilities which would facilitate population-based investigations on the e↵ect of socioeconomic disparities in the utilisation of healthcare services on mortality risk. Keywords Agincourt Cause of death composition Epidemiological Transition Health and Demographic Surveillance System (HDSS) Household assets HIV/AIDS Index of Inequality InterVA Mortality Non-communicable Diseases Population Surveillance Record linkage Rural Socioeconomic Status South Africa Verbal Autopsy Wealth IndexLG201

    Synthetic sequence generator for recommender systems - memory biased random walk on sequence multilayer network

    Full text link
    Personalized recommender systems rely on each user's personal usage data in the system, in order to assist in decision making. However, privacy policies protecting users' rights prevent these highly personal data from being publicly available to a wider researcher audience. In this work, we propose a memory biased random walk model on multilayer sequence network, as a generator of synthetic sequential data for recommender systems. We demonstrate the applicability of the synthetic data in training recommender system models for cases when privacy policies restrict clickstream publishing.Comment: The new updated version of the pape
    • …
    corecore