Search CORE

15 research outputs found

Preparation of name and address data for record linkage using hidden Markov models

Author: A McCallum
A Rigo
AP Dempster
B Aldelberg
C Barrett
D Carnall
D Freitag
D Freitag
DL Ellsworth
DM Bikel
G van Rossum
GD Forney
Justin Xi Zhu
K Seymore
Kim Lim
L Gill
L Gill
L Rabiner
LJ Cook
LL Roos
LR Rabiner
MatchWare Technologies
ME Califf
MJ Khoury
National Center for Biotechnology Information
New South Wales Department of Health
P Armitage
P Christen
P-S Laplace
Peter Christen
Public Health Division
S Soderland
SE Levinson
SF Altschul
Tim Churches
TR Leek
V Borkar
WE Winkler
Publication venue: BioMed Central
Publication date: 01/12/2002
Field of study

BACKGROUND: Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs). METHODS: HMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems. RESULTS: Training of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, acccuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed. CONCLUSION: Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve

ANU Digital Collections

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

Benchmarking natural-language parsers for biological applications using dependency graphs

Author: A Bies
AB Clegg
Adrian J Shepherd
Andrew B Clegg
B Rosario
B Srinivas
C Friedman
C Grover
C Grover
D Blaheta
D Gildea
D Klein
D Klein
D Lin
D Lin
D Sleator
DM Bikel
E Charniak
E Tsivtsivadze
EB Camon
EJ Briscoe
G Sampson
G Schneider
G Schneider
IM Goldin
J Carroll
J Carroll
J Finkel
J Xiao
JM Temkin
K Franzén
K Knight
KB Cohen
L Smith
M Collins
M Lease
MC de Marneffe
MP Marcus
N Domedel-Puig
N Ge
O Sanchez
P Merlo
PG Mutalik
S Abney
S Kübler
S Pyysalo
ST Ahmed
T Briscoe
TC Rindflesch
Y Huang
Z Shi
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria. RESULTS: Using the GENIA corpus as a gold standard, we tested four open-source parsers which have been used in bioinformatics projects. We first present overall performance measures, and test the two leading tools, the Charniak-Lease and Bikel parsers, on subtasks tailored to reflect the requirements of a system for extracting gene expression relationships. These two tools clearly outperform the other parsers in the evaluation, and achieve accuracy levels comparable to or exceeding native dependency parsers on similar tasks in previous biological evaluations. CONCLUSION: Evaluating using dependency graphs allows parsers to be tested easily on criteria chosen according to the semantics of particular biological applications, drawing attention to important mistakes and soaking up many insignificant differences that would otherwise be reported as errors. Generating high-accuracy dependency graphs from the output of phrase-structure parsers also provides access to the more detailed syntax trees that are used in several natural-language processing techniques

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Large-Scale Chinese Cross-Document Entity Disambiguation and Information Fusion

Author: AK Jain
D Downey
DM Bikel
J Dean
J Dean
Q Liu
W Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Improved Arabic-to-English statistical machine translation by reordering post-verbal subjects for word alignment

Author: CD Manning
DM Bikel
FJ Och
J Nivre
Marine Carpuat
N Habash
N Habash
Nizar Habash
PE Brown
Yuval Marton
Publication venue
Publication date: 01/01/2011
Field of study

We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. We therefore propose to reorder VS constructions into SV order for SMT word alignment only. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline and despite noisy parses.

CiteSeerX

NRC Publications Archive

Crossref

Columbia University Academic Commons

Improved Arabic-to-English statistical machine translation by reordering post-verbal subjects for word alignment

Author: CD Manning
DM Bikel
FJ Och
J Nivre
Marine Carpuat
N Habash
N Habash
Nizar Habash
PE Brown
Yuval Marton
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A Hybrid Approach for Persian Named Entity Recognition

Author: C-s Lee
D Nadeau
DM Bikel
E Brill
H Saggion
J-L Seng
M Marrero
NH Sung
R Mihalcea
T Tsai
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref