Search CORE

25 research outputs found

Graph-based Household Matching for Linking Census Data

Author: Khin Su Mon Myint
Win Win Naing
Publication venue
Publication date: 02/11/2018
Field of study

Historical censuses consist of individual facts about a community. It provides knowledge concerned with the nation’s population. These data apply the reconstruction features of a specific period to trace their ancestors and families changes over time. Linking census data is a difficult task as common names, data quality and household changes over time. During the decades, a household may split multiple households due to marriage or move to another household. This paper proposes a graph-based approach to link households, which takes the relationship between household members. Using individual record linking results, the proposed method builds household graphs, so that the matches are determined by attribute similarity and records relationship similarity. According to the experimental results, the proposed method reaches an F-score of 0.974on Ireland Census data, outperforming all alternative methods being compared

MERAL Portal

REQUEST AWARE STRENGTH OF CHARACTER OF INDEFINITE OBJECTS

Author: Appala Naidu Dr R China
Sindhura G.
Publication venue: International Journal of Innovative Technology and Research
Publication date: 23/10/2016
Field of study

The goal should be to create a deterministic representation of probabilistic data that maximizes the grade of in conclusion-application built on deterministic data. We explore this type of determination problem poor two different computer tasks triggers and selection queries. A much better approach ought to be to design customized determination techniques that pick a determined representation which maximizes the grade of in conclusion-application. Probabilistic data may be created by automated data analysis/enrichment means of example entity resolution, information extraction, and speech processing. Rather, we produce a query-aware strategy and show its advantages over existing solutions employing a comprehensive empirical evaluation over real and artificial datasets. The legacy system may match pre-existing web programs for instance Flickr, Picasa, etc. This paper views the problem of exercising probabilistic data allowing such data to acquire stored in legacy systems that accept only deterministic input. We show way of example thresholding or top-1 selection typically useful for determination lead to suboptimal performance for such programs

International Journal of Innovative Technology and Research (IJITR)

Query-Driven Sampling for Collective Entity Resolution

Author: Grant Christan
Wang Daisy Zhe
Wick Michael L.
Publication venue
Publication date: 13/08/2015
Field of study

Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeling. Many of these efforts are based on batch oriented inference which inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators --- selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions

arXiv.org e-Print Archive

Crossref

A hierarchical Bayesian approach to record linkage and population size problems

Author: Liseo Brunero
Tancredi Andrea
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2011
Field of study

We propose and illustrate a hierarchical Bayesian approach for matching statistical records observed on different occasions. We show how this model can be profitably adopted both in record linkage problems and in capture--recapture setups, where the size of a finite population is the real object of interest. There are at least two important differences between the proposed model-based approach and the current practice in record linkage. First, the statistical model is built up on the actually observed categorical variables and no reduction (to 0--1 comparisons) of the available information takes place. Second, the hierarchical structure of the model allows a two-way propagation of the uncertainty between the parameter estimation step and the matching procedure so that no plug-in estimates are used and the correct uncertainty is accounted for both in estimating the population size and in performing the record linkage. We illustrate and motivate our proposal through a real data example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Author: Christen Peter
Gayler Ross
Publication venue: Association for Computing Machinery Inc (ACM)
Publication date: 24/02/2016
Field of study

Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have as-sumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore be-comes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted in-dex approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index ap-proaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when tandard blocking is used, and thus more work is required

The Australian National University

An efficient record linkage scheme using graphical analysis for identifier error detection

Author: A Arasu
A McCallum
A Sarah Walker
David H Wyllie
DH Wyllie
DH Wyllie
DH Wyllie
DV Kalashnikov
EA Sauleau
I Fellegi
John M Finney
L Phillips
RA Lyons
S Chapman
S Deepayan
SE Schaeffer
T Teorey
Tim EA Peto
V Levenhstein
V Rares
WE Winkler
WN Venables
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UCL Discovery

Oxford University Research Archive

Personalized Biomedical Data Integration

Author: Ian Foster
Olufunmilayo Olopade
Xiaoming Wang
Publication venue: 'IntechOpen'
Publication date: 08/01/2011
Field of study

IntechOpen

Management of Inconsistencies in Data Integration

Author: Ioannou Ekaterini
Staworko Slawek
Publication venue
Publication date: 01/01/2013
Field of study

Data integration aims at providing a unified view over data coming from various sources. One of the most challenging tasks for data integration is handling the inconsistencies that appear in the integrated data in an efficient and effective manner. In this chapter, we provide a survey on techniques introduced for handling inconsistencies in data integration, focusing on two groups. The first group contains techniques for computing consistent query answers, and includes mechanisms for the compact representation of repairs, query rewriting, and logic programs. The second group contains techniques focusing on the resolution of inconsistencies. This includes methodologies for computing similarity between atomic values as well as similarity between groups of data, collective techniques, scaling to large datasets, and dealing with uncertainty that is related to inconsistencies

CiteSeerX

Edinburgh Research Explorer

Dagstuhl Research Online Publication Server