35 research outputs found
KnowNER: Incremental Multilingual Knowledge in Named Entity Recognition
KnowNER is a multilingual Named Entity Recognition (NER) system that
leverages different degrees of external knowledge. A novel modular framework
divides the knowledge into four categories according to the depth of knowledge
they convey. Each category consists of a set of features automatically
generated from different information sources (such as a knowledge-base, a list
of names or document-specific semantic annotations) and is used to train a
conditional random field (CRF). Since those information sources are usually
multilingual, KnowNER can be easily trained for a wide range of languages. In
this paper, we show that the incorporation of deeper knowledge systematically
boosts accuracy and compare KnowNER with state-of-the-art NER approaches across
three languages (i.e., English, German and Spanish) performing amongst
state-of-the art systems in all of them
Multilingual Language Processing From Bytes
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads
text as bytes and outputs span annotations of the form [start, length, label]
where start positions, lengths, and labels are separate entries in our
vocabulary. Because we operate directly on unicode bytes rather than
language-specific words or characters, we can analyze text in many languages
with a single model. Due to the small vocabulary size, these multilingual
models are very compact, but produce results similar to or better than the
state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that
use only the provided training datasets (no external data sources). Our models
are learning "from scratch" in that they do not rely on any elements of the
standard pipeline in Natural Language Processing (including tokenization), and
thus can run in standalone fashion on raw text
Neural Named Entity Recognition from Subword Units
Named entity recognition (NER) is a vital task in spoken language
understanding, which aims to identify mentions of named entities in text e.g.,
from transcribed speech. Existing neural models for NER rely mostly on
dedicated word-level representations, which suffer from two main shortcomings.
First, the vocabulary size is large, yielding large memory requirements and
training time. Second, these models are not able to learn morphological or
phonological representations. To remedy the above shortcomings, we adopt a
neural solution based on bidirectional LSTMs and conditional random fields,
where we rely on subword units, namely characters, phonemes, and bytes. For
each word in an utterance, our model learns a representation from each of the
subword units. We conducted experiments in a real-world large-scale setting for
the use case of a voice-controlled device covering four languages with up to
5.5M utterances per language. Our experiments show that (1) with increasing
training data, performance of models trained solely on subword units becomes
closer to that of models with dedicated word-level embeddings (91.35 vs 93.92
F1 for English), while using a much smaller vocabulary size (332 vs 74K), (2)
subword units enhance models with dedicated word-level embeddings, and (3)
combining different subword units improves performance.Comment: 5 pages, INTERSPEECH 201
Chinese named entity recognition using lexicalized HMMs
This paper presents a lexicalized HMM-based approach to Chinese named entity recognition (NER). To tackle the problem of unknown words, we unify unknown word identification and NER as a single tagging task on a sequence of known words. To do this, we first employ a known-word bigram-based model to segment a sentence into a sequence of known words, and then apply the uniformly lexicalized HMMs to assign each known word a proper hybrid tag that indicates its pattern in forming an entity and the category of the formed entity. Our system is able to integrate both the internal formation patterns and the surrounding contextual clues for NER under the framework of HMMs. As a result, the performance of the system can be improved without losing its efficiency in training and tagging. We have tested our system using different public corpora. The results show that lexicalized HMMs can substantially improve NER performance over standard HMMs. The results also indicate that character-based tagging (viz. the tagging based on pure single-character words) is comparable to and can even outperform the relevant known-word based tagging when a lexicalization technique is applied.postprin
A System for Identifying Named Entities in Biomedical Text: how Results From two Evaluations Reflect on Both the System and the Evaluations
We present a maximum entropy-based system for identifying named entities (NEs) in
biomedical abstracts and present its performance in the only two biomedical named
entity recognition (NER) comparative evaluations that have been held to date, namely
BioCreative and Coling BioNLP. Our system obtained an exact match F-score of
83.2% in the BioCreative evaluation and 70.1% in the BioNLP evaluation. We discuss
our system in detail, including its rich use of local features, attention to correct
boundary identification, innovative use of external knowledge resources, including
parsing and web searches, and rapid adaptation to new NE sets. We also discuss
in depth problems with data annotation in the evaluations which caused the final
performance to be lower than optimal
Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media
Recognizing named entities in a document is a key task in many NLP
applications. Although current state-of-the-art approaches to this task reach a
high performance on clean text (e.g. newswire genres), those algorithms
dramatically degrade when they are moved to noisy environments such as social
media domains. We present two systems that address the challenges of processing
social media data using character-level phonetics and phonology, word
embeddings, and Part-of-Speech tags as features. The first model is a multitask
end-to-end Bidirectional Long Short-Term Memory (BLSTM)-Conditional Random
Field (CRF) network whose output layer contains two CRF classifiers. The second
model uses a multitask BLSTM network as feature extractor that transfers the
learning to a CRF classifier for the final prediction. Our systems outperform
the current F1 scores of the state of the art on the Workshop on Noisy
User-generated Text 2017 dataset by 2.45% and 3.69%, establishing a more
suitable approach for social media environments.Comment: NAACL 201