25 research outputs found
Graph-based Household Matching for Linking Census Data
Historical censuses consist of individual facts about a community. It provides knowledge concerned with the nation’s population. These data apply the reconstruction features of a specific period to trace their ancestors and families changes over time. Linking census data is a difficult task as common names, data quality and household changes over time. During the decades, a household may split multiple households due to marriage or move to another household. This paper proposes a graph-based approach to link households, which takes the relationship between household members. Using individual record linking results, the proposed method builds household graphs, so that the matches are determined by attribute similarity and records relationship similarity. According to the experimental results, the proposed method reaches an F-score of 0.974on Ireland Census data, outperforming all alternative methods being compared
REQUEST AWARE STRENGTH OF CHARACTER OF INDEFINITE OBJECTS
The goal should be to create a deterministic representation of probabilistic data that maximizes the grade of in conclusion-application built on deterministic data. We explore this type of determination problem poor two different computer tasks triggers and selection queries. A much better approach ought to be to design customized determination techniques that pick a determined representation which maximizes the grade of in conclusion-application. Probabilistic data may be created by automated data analysis/enrichment means of example entity resolution, information extraction, and speech processing. Rather, we produce a query-aware strategy and show its advantages over existing solutions employing a comprehensive empirical evaluation over real and artificial datasets. The legacy system may match pre-existing web programs for instance Flickr, Picasa, etc. This paper views the problem of exercising probabilistic data allowing such data to acquire stored in legacy systems that accept only deterministic input. We show way of example thresholding or top-1 selection typically useful for determination lead to suboptimal performance for such programs
Query-Driven Sampling for Collective Entity Resolution
Probabilistic databases play a preeminent role in the processing and
management of uncertain data. Recently, many database research efforts have
integrated probabilistic models into databases to support tasks such as
information extraction and labeling. Many of these efforts are based on batch
oriented inference which inhibits a realtime workflow. One important task is
entity resolution (ER). ER is the process of determining records (mentions) in
a database that correspond to the same real-world entity. Traditional pairwise
ER methods can lead to inconsistencies and low accuracy due to localized
decisions. Leading ER systems solve this problem by collectively resolving all
records using a probabilistic graphical model and Markov chain Monte Carlo
(MCMC) inference. However, for large datasets this is an extremely expensive
process. One key observation is that, such exhaustive ER process incurs a huge
up-front cost, which is wasteful in practice because most users are interested
in only a small subset of entities. In this paper, we advocate pay-as-you-go
entity resolution by developing a number of query-driven collective ER
techniques. We introduce two classes of SQL queries that involve ER operators
--- selection-driven ER and join-driven ER. We implement novel variations of
the MCMC Metropolis Hastings algorithm to generate biased samples and
selectivity-based scheduling algorithms to support the two classes of ER
queries. Finally, we show that query-driven ER algorithms can converge and
return results within minutes over a database populated with the extraction
from a newswire dataset containing 71 million mentions
A hierarchical Bayesian approach to record linkage and population size problems
We propose and illustrate a hierarchical Bayesian approach for matching
statistical records observed on different occasions. We show how this model can
be profitably adopted both in record linkage problems and in capture--recapture
setups, where the size of a finite population is the real object of interest.
There are at least two important differences between the proposed model-based
approach and the current practice in record linkage. First, the statistical
model is built up on the actually observed categorical variables and no
reduction (to 0--1 comparisons) of the available information takes place.
Second, the hierarchical structure of the model allows a two-way propagation of
the uncertainty between the parameter estimation step and the matching
procedure so that no plug-in estimates are used and the correct uncertainty is
accounted for both in estimating the population size and in performing the
record linkage. We illustrate and motivate our proposal through a real data
example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach
Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have as-sumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore be-comes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted in-dex approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index ap-proaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when tandard blocking is used, and thus more work is required
An efficient record linkage scheme using graphical analysis for identifier error detection
Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone
Management of Inconsistencies in Data Integration
Data integration aims at providing a unified view over data coming from various sources. One of the most challenging tasks for data integration is handling the inconsistencies that appear in the integrated data in an efficient and effective manner. In this chapter, we provide a survey on techniques introduced for handling inconsistencies in data integration, focusing on two groups. The first group contains techniques for computing consistent query answers, and includes mechanisms for the compact representation of repairs, query rewriting, and logic programs. The second group contains techniques focusing on the resolution of inconsistencies. This includes methodologies for computing similarity between atomic values as well as similarity between groups of data, collective techniques, scaling to large datasets, and dealing with uncertainty that is related to inconsistencies