CORE
CO
nnecting
RE
positories
Services
Services overview
Explore all CORE services
Access to raw data
API
Dataset
FastSync
Content discovery
Recommender
Discovery
OAI identifiers
OAI Resolver
Managing content
Dashboard
Bespoke contracts
Consultancy services
Support us
Support us
Membership
Sponsorship
Research partnership
About
About
About us
Our mission
Team
Blog
FAQs
Contact us
Community governance
Governance
Advisory Board
Board of supporters
Research network
Innovations
Our research
Labs
How to make the most of NE dictionaries in statistical NER
Authors
A McCallum
B Settles
+26 more
D Okanohara
EF Tjong Kim Sang
GD Zhou
J Aoe
J Finkel
J Kazama
J Lafferty
J-D Kim
JD Kim
John McNaught
K Franzen
K Fukuda
K Yamamoto
K-M Park
KJ Lee
L Tanabe
LE Baum
M Rössler
N Collier
S Kim
Sophia Ananiadou
T Kudo
TH Tsai
Y Song
Yoshimasa Tsuruoka
Yutaka Sasaki
Publication date
1 January 2008
Publisher
BioMed Central
Doi
View
on
PubMed
Abstract
Background: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity. Methods: We have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached. Results: We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names. Conclusion: Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER. © 2008 Sasaki et al; licensee BioMed Central Ltd
Similar works
Full text
Open in the Core reader
Download PDF
Available Versions
Springer - Publisher Connector
See this paper in CORE
Go to the repository landing page
Download from data provider
Last time updated on 29/04/2017
Crossref
See this paper in CORE
Go to the repository landing page
Download from data provider
Last time updated on 12/12/2020
The University of Manchester - Institutional Repository
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:pure.atira.dk:openaire_cri...
Last time updated on 09/10/2025
The University of Manchester - Institutional Repository
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:pure.atira.dk:publications...
Last time updated on 01/02/2017
Springer - Publisher Connector
See this paper in CORE
Go to the repository landing page
Download from data provider
Last time updated on 05/06/2019