4,583 research outputs found
Data Integration: Techniques and Evaluation
Within the DIECOFIS framework, ec3, the Division of Business
Statistics from the Vienna University of Economics and Business
Administration and ISTAT worked together to find methods to create a
comprehensive database of enterprise data required for taxation microsimulations
via integration of existing disparate enterprise data sources. This
paper provides an overview of the broad spectrum of investigated
methodology (including exact and statistical matching as well as
imputation) and related statistical quality indicators, and emphasises the
relevance of data integration, especially for official statistics, as a means of
using available information more efficiently and improving the quality of a
statistical agency's products. Finally, an outlook on an empirical study
comparing different exact matching procedures in the maintenance of
Statistics Austria's Business Register is presented
Quality and complexity measures for data linkage and deduplication
Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures
A Bayesian Approach to Graphical Record Linkage and De-duplication
We propose an unsupervised approach for linking records across arbitrarily
many files, while simultaneously detecting duplicate records within files. Our
key innovation involves the representation of the pattern of links between
records as a bipartite graph, in which records are directly linked to latent
true individuals, and only indirectly linked to other records. This flexible
representation of the linkage structure naturally allows us to estimate the
attributes of the unique observable people in the population, calculate
transitive linkage probabilities across records (and represent this visually),
and propagate the uncertainty of record linkage into later analyses. Our method
makes it particularly easy to integrate record linkage with post-processing
procedures such as logistic regression, capture-recapture, etc. Our linkage
structure lends itself to an efficient, linear-time, hybrid Markov chain Monte
Carlo algorithm, which overcomes many obstacles encountered by previously
record linkage approaches, despite the high-dimensional parameter space. We
illustrate our method using longitudinal data from the National Long Term Care
Survey and with data from the Italian Survey on Household and Wealth, where we
assess the accuracy of our method and show it to be better in terms of error
rates and empirical scalability than other approaches in the literature.Comment: 39 pages, 8 figures, 8 tables. Longer version of arXiv:1403.0211, In
press, Journal of the American Statistical Association: Theory and Methods
(2015
Entity Resolution using Convolutional Neural Network
Entity resolution is an important application in field of data cleaning. Standard approaches like deterministic methods and probabilistic methods are generally used for this purpose. Many new approaches using single layer perceptron, crowdsourcing etc. are developed to improve the efficiency and also to reduce the time of entity resolution. The approaches used for this purpose also depend on the type of dataset, labeled or unlabeled. This paper presents a new method for labeled data which uses single layered convolutional neural network to perform entity resolution. It also describes how crowdsourcing can be used with the output of the convolutional neural network to further improve the accuracy of the approach while minimizing the cost of crowdsourcing. The paper also discusses the data pre-processing steps used for training the convolutional neural network. Finally it describes the airplane sensor dataset which is used for demonstration of this approach and then shows the experimental results achieved using convolutional neural network
A hierarchical Bayesian approach to record linkage and population size problems
We propose and illustrate a hierarchical Bayesian approach for matching
statistical records observed on different occasions. We show how this model can
be profitably adopted both in record linkage problems and in capture--recapture
setups, where the size of a finite population is the real object of interest.
There are at least two important differences between the proposed model-based
approach and the current practice in record linkage. First, the statistical
model is built up on the actually observed categorical variables and no
reduction (to 0--1 comparisons) of the available information takes place.
Second, the hierarchical structure of the model allows a two-way propagation of
the uncertainty between the parameter estimation step and the matching
procedure so that no plug-in estimates are used and the correct uncertainty is
accounted for both in estimating the population size and in performing the
record linkage. We illustrate and motivate our proposal through a real data
example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Identification, data combination and the risk of disclosure
Businesses routinely rely on econometric models to analyze and predict consumer behavior. Estimation of such models may require combining a firm's internal data with external datasets to take into account sample selection, missing observations, omitted variables and errors in measurement within the existing data source. In this paper we point out that these data problems can be addressed when estimating econometric models from combined data using the data mining techniques under mild assumptions regarding the data distribution. However, data combination leads to serious threats to security of consumer data: we demonstrate that point identification of an econometric model from combined data is incompatible with restrictions on the risk of individual disclosure. Consequently, if a consumer model is point identified, the firm would (implicitly or explicitly) reveal the identity of at least some of consumers in its internal data. More importantly, we provide an argument that unless the firm places a restriction on the individual disclosure risk when combining data, even if the raw combined dataset is not shared with a third party, an adversary or a competitor can gather confidential information regarding some individuals from the estimated model.
Changes in mortality patterns and associated socioeconomic differentials in a rural South African setting: findings from population surveillance in Agincourt, 1993-2013
A thesis submitted to the Faculty of Health Sciences, University of the
Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of
Doctor of Philosophy (by publications)
20th December 2017.Understanding a population’s mortality and disease patterns and their determinants is important for
setting locally-relevant health and development priorities, identifying critical elements for strengthening
of health systems, and determining the focus of health services and programmes. This thesis investigates
changes in socioeconomic status (SES), cause composition of overall mortality and the socioeconomic
patterning of mortality that occurred in a rural population in Agincourt, northeast South Africa over the
period 1993-2013 using Health and Demographic Surveillance Systems (HDSS) data. It also assesses the
feasibility of applying record linkage techniques to integrate data from HDSS and health facilities in order
to enhance the utility of HDSS data for studying mortality and disease patterns and their determinants
and implications in populations in resource-poor settings where vital registration systems are often weak.
Results show a steady increase in the proportion of households that own assets associated with greater
modern wealth and convergence towards the middle of the SES distribution over the period 2001-2013.
However, improvements in SES were slower for poorer households and persistently varied by ethnicity
with former Mozambican refugees being at a disadvantage. The population experienced steady and
substantial increase in overall and communicable diseases related mortality from the mid-1990s to the
mid-2000s, peaking around 2005-07 due to the HIV/AIDS epidemic. Overall mortality steadily declined
afterwards following reduction in HIV/AIDS-related mortality due to the widespread introduction of free
antiretroviral therapy (ART) available from public health facilities. By 2013, however, the cause of death
distribution was yet to reach the levels it occupied in the early 1990s. Overall, the poorest individuals
in the population experienced the highest mortality burden and HIV/AIDS and tuberculosis mortality
persistently showed an inverse relation with SES throughout the period 2001-13. Although mortality
from non-communicable diseases (NCDs) increased over time in both sexes and injuries were a prominent
cause of death in males, neither of these causes of death showed consistent significant associations with
household SES. A hybrid approach of deterministic followed by probabilistic record linkage, and the use of
an extended set of conventional identifiers that included another household member’s first name yielded
the best results for linking data from the Agincourt HDSS and health facilities with a sensitivity of
83.6% and a positive predictive value (PPV) of 95.1% for the best fully automated approach. In general,
the findings highlight the need to identify the chronically poorest individuals and target them with
interventions that can improve their SES and take them out of the vicious circle of poverty. The results
also highlight the need for integrated health-care planning and programme delivery strategies to increase
access to and uptake of HIV testing, linkage to care and ART, and prevention and treatment of NCDs
especially among the poorest individuals to reduce the inequalities in cause-specific and overall mortality.
The findings also contribute to the evidence base to inform further refinement and advancement of the
health and epidemiological transition theory. Furthermore, the findings demonstrate the feasibility of
linking HDSS data with data from health facilities which would facilitate population-based investigations
on the e↵ect of socioeconomic disparities in the utilisation of healthcare services on mortality risk.
Keywords
Agincourt
Cause of death composition
Epidemiological Transition
Health and Demographic Surveillance System (HDSS)
Household assets
HIV/AIDS
Index of Inequality
InterVA
Mortality
Non-communicable Diseases
Population Surveillance
Record linkage
Rural
Socioeconomic Status
South Africa
Verbal Autopsy
Wealth IndexLG201
Synthetic sequence generator for recommender systems - memory biased random walk on sequence multilayer network
Personalized recommender systems rely on each user's personal usage data in
the system, in order to assist in decision making. However, privacy policies
protecting users' rights prevent these highly personal data from being publicly
available to a wider researcher audience. In this work, we propose a memory
biased random walk model on multilayer sequence network, as a generator of
synthetic sequential data for recommender systems. We demonstrate the
applicability of the synthetic data in training recommender system models for
cases when privacy policies restrict clickstream publishing.Comment: The new updated version of the pape
- …