Search CORE

5 research outputs found

Unsupervised record matching with noisy and incomplete data

Author: A Strehl
AA Abraham
AK Elmagarmid
AK McCallum
AL Traud
C Xiao
CJ van Rijsbergen
DM Blei
DT Larose
E Amigó
E González
F Naumann
G Pang
G Papadakis
G Salton
G Salton
H Bano
H Kim
H Kim
H Kim
I Ahmed
J Friedman
JJ Tamilselvi
L Chi
L Leitao
M Huisman
M Meilă
MA Jaro
MA Jaro
NJ Horton
O Hassanzadeh
P Christen
PD Allison
R Baeza-Yates
R Hall
R Nuray-Turan
R Ramya
RR Kannan
S Tejada
SE Whang
ST Dumais
T De Vries
T Papenbrock
TD Pigott
Y van Gennip
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

We consider the problem of duplicate detection in noisy and incomplete data: given a large data set in which each record has multiple entries (attributes), detect which distinct records refer to the same real world entity. This task is complicated by noise (such as misspellings) and missing data, which can lead to records being different, despite referring to the same entity. Our method consists of three main steps: creating a similarity score between records, grouping records together into "unique entities", and refining the groups. We compare various methods for creating similarity scores between noisy records, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We also discuss two methods to deal with missing data in computing similarity scores. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that the methods that use words as the basic units are preferable to those that use 3-grams. Moreover, in some (but certainly not all) parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method. The results also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set

Nottingham ePrints

arXiv.org e-Print Archive

Nottingham eTheses

Crossref

Repository@Nottingham

eScholarship - University of California

A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage

Author: B Ramadan
Banda Ramadan
Banda Ramadan
C Faloutsos
D Karapiperis
Dezhao Song
Ekaterini Ioannou
F Guillet
F Naumann
G Papadakis
G Papadakis
G Papadakis
G Papadakis
G Papadakis
G Simonini
Huizhi Liang
J Leskovec
J Tamilselvi
JJ Tamilselvi
L Kolb
M Herschel
M Kejriwal
M Kejriwal
MA Hernández
Manuel Atencia
Mayank Kejriwal
P Christen
P Christen
P Christen
P Christen
P Lehti
Q Wang
R Jonker
Rebecca C. Steorts
SE Whang
T Papenbrock
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/11/2018
Field of study

Record linkage, referred to also as entity resolution, is a process of identifying records representing the same real-world entity (e.g. a person) across varied data sources. To reduce the computational complexity associated with record comparisons, a task referred to as blocking is commonly performed prior to the linkage process. The blocking task involves partitioning records into blocks of records and treating records from different blocks as not related to the same entity. Following this, record linkage methods are applied within each block significantly reducing the number of record comparisons. Most of the existing blocking techniques require some degree of parameter selection in order to optimise the performance for a particular dataset (e.g. attributes and blocking functions used for splitting records into blocks). Optimal parameters can be selected manually but this is expensive in terms of time and cost and assumes a domain expert to be available. Automatic supervised blocking techniques have been proposed; however, they require a set of labelled data in which the matching status of each record is known. In the majority of real-world scenarios, we do not have any information regarding the matching status of records obtained from multiple sources. Therefore, there is a demand for blocking techniques that sufficiently reduce the number of record comparisons with little to no human input or labelled data required. Given the importance of the problem, recent research efforts have seen the development of novel unsupervised and semi-supervised blocking techniques. In this chapter, we review existing blocking techniques and discuss their advantages and disadvantages. We detail other research areas that have recently arose and discuss other unresolved issues that are still to be addressed

Queen's University Belfast Research Portal

Crossref

Repository TU/e

Pure OAI Repository

Discovery of Novel New Delhi Metallo-β-Lactamases-1 Inhibitors by Multistep Virtual Screening

Author: A Tamilselvi
AM Wailan
B Jaurin
B Shen
C Arpin
D King
D Yong
DT King
F Mincione
G Marcou
G Schneider
H Zhang
Horacio Bach
J Chen
J Dundas
J Goto
JC Cole
JD Docquier
JD Docquier
JF Fisher
JJ Irwin
JJ Irwin
JM Rollinger
JW Dale
K Bush
K Karthikeyan
KK Kumarasamy
L Olsen
M Feher
M Galleni
M Morar
M Shahid
MC Brown
Meiling Lu
MS Attene-Ramos
MS Mohamed
N Selevsek
NR Stone
P Labute
PA Bradford
PW Thomas
Q Yuan
R Isozumi
R Laxminarayan
RP Ambler
SA Wildman
SL McGovern
T Li
TA Halgren
TR Walsh
VL Green
W Fast
X Hu
X Hu
Xiaodong Cheng
Xuequan Wang
Y Guo
Y Kim
Y Kim
Yang Shi
Yu Ou
Publication venue: 'Public Library of Science (PLoS)'
Publication date
Field of study

Crossref

Impacts of coexisting antibiotics, antibacterial residues, and heavy metals on the occurrence of erythromycin resistance genes in urban wastewater

Author: A Alonso
A Novo
A Pruden
A Rajbanshi
A Tamilselvi
A Tello
AL Spongberg
C Baker-Austin
C Bednorz
C Seral
CA Fan
CS Hölzel
DG White
G Winkler
Gang Xue
H Hasman
H Hasman
H Singer
HK Allen
HP Schweizer
I Michael
I Pala-Ozkok
J Chen
J Heidler
JJ Calomiris
JM Buth
Kanzhu Li
L Rizzo
M Braoudaki
M Karvelas
MC Roberts
MN Alekshun
MS Nawaz
N Ardic
P Choudhury
P Gao
P Gao
Pin Gao
PM Costa Da
QC Zhang
R Hermsen
R Stepanauskas
R Szczepanowski
S Saleh
Shenglin Huang
Shi He
SP Yazdankhah
TM LaPara
Weimin Sun
X Yang
XD Li
XJ Chen
XS Chang
XX Zhang
Y Aktan
Zhenhong Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A REVIEW ON BIOMEDICAL IMAGE ANALYSIS

Crossref