25 research outputs found

    Graph-based Household Matching for Linking Census Data

    Get PDF
    Historical censuses consist of individual facts about a community. It provides knowledge concerned with the nation’s population. These data apply the reconstruction features of a specific period to trace their ancestors and families changes over time. Linking census data is a difficult task as common names, data quality and household changes over time. During the decades, a household may split multiple households due to marriage or move to another household. This paper proposes a graph-based approach to link households, which takes the relationship between household members. Using individual record linking results, the proposed method builds household graphs, so that the matches are determined by attribute similarity and records relationship similarity. According to the experimental results, the proposed method reaches an F-score of 0.974on Ireland Census data, outperforming all alternative methods being compared

    REQUEST AWARE STRENGTH OF CHARACTER OF INDEFINITE OBJECTS

    Get PDF
    The goal should be to create a deterministic representation of probabilistic data that maximizes the grade of in conclusion-application built on deterministic data. We explore this type of determination problem poor two different computer tasks triggers and selection queries. A much better approach ought to be to design customized determination techniques that pick a determined representation which maximizes the grade of in conclusion-application. Probabilistic data may be created by automated data analysis/enrichment means of example entity resolution, information extraction, and speech processing. Rather, we produce a query-aware strategy and show its advantages over existing solutions employing a comprehensive empirical evaluation over real and artificial datasets. The legacy system may match pre-existing web programs for instance Flickr, Picasa, etc. This paper views the problem of exercising probabilistic data allowing such data to acquire stored in legacy systems that accept only deterministic input. We show way of example thresholding or top-1 selection typically useful for determination lead to suboptimal performance for such programs

    Query-Driven Sampling for Collective Entity Resolution

    Full text link
    Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeling. Many of these efforts are based on batch oriented inference which inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators --- selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions

    A hierarchical Bayesian approach to record linkage and population size problems

    Full text link
    We propose and illustrate a hierarchical Bayesian approach for matching statistical records observed on different occasions. We show how this model can be profitably adopted both in record linkage problems and in capture--recapture setups, where the size of a finite population is the real object of interest. There are at least two important differences between the proposed model-based approach and the current practice in record linkage. First, the statistical model is built up on the actually observed categorical variables and no reduction (to 0--1 comparisons) of the available information takes place. Second, the hierarchical structure of the model allows a two-way propagation of the uncertainty between the parameter estimation step and the matching procedure so that no plug-in estimates are used and the correct uncertainty is accounted for both in estimating the population size and in performing the record linkage. We illustrate and motivate our proposal through a real data example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

    Get PDF
    Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have as-sumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore be-comes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted in-dex approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index ap-proaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when tandard blocking is used, and thus more work is required

    An efficient record linkage scheme using graphical analysis for identifier error detection

    Get PDF
    Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone

    Personalized Biomedical Data Integration

    Get PDF

    Management of Inconsistencies in Data Integration

    Get PDF
    Data integration aims at providing a unified view over data coming from various sources. One of the most challenging tasks for data integration is handling the inconsistencies that appear in the integrated data in an efficient and effective manner. In this chapter, we provide a survey on techniques introduced for handling inconsistencies in data integration, focusing on two groups. The first group contains techniques for computing consistent query answers, and includes mechanisms for the compact representation of repairs, query rewriting, and logic programs. The second group contains techniques focusing on the resolution of inconsistencies. This includes methodologies for computing similarity between atomic values as well as similarity between groups of data, collective techniques, scaling to large datasets, and dealing with uncertainty that is related to inconsistencies
    corecore