Search CORE

622 research outputs found

Coreference detection of low quality objects

Author: A. Bronselaer
D. Gusfield
I. Fellegi
P. Lehti
P. Ravikumar
S. Tejada
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

The problem of record linkage is a widely studied problem that aims to identify coreferent (i.e. duplicate) data in a structured data source. As indicated by Winkler, a solution to the record linkage problem is only possible if the error rate is sufficiently low. In other words, in order to succesfully deduplicate a database, the objects in the database must be of sufficient quality. However, this assumption is not always feasible. In this paper, it is investigated how merging of low quality objects into one high quality object can improve the process of record linkage. This general idea is illustrated in the context of strings comparison, where strings of low quality (i.e. with a high typographical error rate) are merged into a string of high quality by using an n-dimensional Levenshtein distance matrix and compute the optimal alignment between the dirty strings. Results are presented and possible refinements are proposed

Crossref

Ghent University Academic Bibliography

Data Deduplication with Random Substitutions

Author: Farnoud Farzad
Lou Hao
Publication venue
Publication date: 01/07/2021
Field of study

Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algorithms on data streams in which repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. More precisely, each symbol in a repeated string is substituted with a given edit probability. Deduplication algorithms in both the fixed-length scheme and the variable-length scheme are studied. The fixed-length deduplication algorithm is shown to be unsuitable for the proposed source model as it does not take into account the edit probability. Two modifications are proposed and shown to have performances within a constant factor of optimal with the knowledge of source model parameters. We also study the conventional variable-length deduplication algorithm and show that as source entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string, leading to high compression ratios

arXiv.org e-Print Archive

Quality and complexity measures for data linkage and deduplication

Author: C Shearer
Centre for Epidemiology and Research NSW Department of Health
CW Kelman
D Pyle
DP Bertsekas
DS Zingmond
E Rahm
HB Newcombe
I Fellegi
L Gill
MA Hernandez
ME Smith
RA Baeza-Yates
S Gomatam
S Salzberg
T Blakely
TROC Fawcett
WS Cooper
Publication venue: Springer
Publication date: 01/01/2007
Field of study

Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures

CiteSeerX

Crossref

Record-Linkage from a Technical Point of View

Author: Rainer Schnell
Publication venue
Publication date
Field of study

TRecord linkage is used for preparing sampling frames, deduplication of lists and combining information on the same object from two different databases. If the identifiers of the same objects in two different databases have error free unique common identifiers like personal identification numbers (PID), record linkage is a simple file merge operation. If the identifiers contains errors, record linkage is a challenging task. In many applications, the files have widely different numbers of observations, for example a few thousand records of a sample survey and a few million records of an administrative database of social security numbers. Available software, privacy issues and future research topics are discussed.Record-Linkage, Data-mining, Privacy preserving protocols

Research Papers in Economics

Low-redundancy codes for correcting multiple short-duplication and edit errors

Author: Farnoud Farzad
Gabrys Ryan
Lou Hao
Tang Yuanyuan
Wang Shuche
Publication venue
Publication date: 03/08/2022
Field of study

Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most

p

edits, where a short duplication generates a copy of a substring with length

\leq 3

and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to

p

edits (in addition to duplications) at the additional cost of roughly

8p(\log_q n)(1+o(1))

symbols of redundancy, thus achieving the same asymptotic rate, where

q\ge 4

is the alphabet size and

p

is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when

p

is a constant with respect to the code length.Comment: 21 pages. The paper has been submitted to IEEE Transaction on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and ISIT202

arXiv.org e-Print Archive

A comparison of personal name matching: Techniques and practical issues

Author: Christen Peter
Publication venue: Canberra, ACT: Dept. of Computer Science / Computer Sciences Laboratory, The Australian National University
Publication date: 08/12/2015
Field of study

Finding and matching personal names is at the core of an increasing number of applications: from text and Web mining, information retrieval and extraction, search engines, to deduplication and data linkage systems. Variations and errors in names make exact string matching problematic, and approximate matching techniques based on phonetic encoding or pattern matching have to be applied. When compared to general text, however, personal names have different characteristics that need to be considered. ¶ In this paper we discuss the characteristics of personal names and present potential sources of variations and errors. We overview a comprehensive number of commonly used, as well as some recently developed name matching techniques. Experimental comparisons on four large name data sets indicate that there is no clear best technique. We provide a series of recommendations that will help researchers and practitioners to select a name matching technique suitable for a given data set

The Australian National University