35 research outputs found

    KnowNER: Incremental Multilingual Knowledge in Named Entity Recognition

    Full text link
    KnowNER is a multilingual Named Entity Recognition (NER) system that leverages different degrees of external knowledge. A novel modular framework divides the knowledge into four categories according to the depth of knowledge they convey. Each category consists of a set of features automatically generated from different information sources (such as a knowledge-base, a list of names or document-specific semantic annotations) and is used to train a conditional random field (CRF). Since those information sources are usually multilingual, KnowNER can be easily trained for a wide range of languages. In this paper, we show that the incorporation of deeper knowledge systematically boosts accuracy and compare KnowNER with state-of-the-art NER approaches across three languages (i.e., English, German and Spanish) performing amongst state-of-the art systems in all of them

    Multilingual Language Processing From Bytes

    Full text link
    We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text

    Neural Named Entity Recognition from Subword Units

    Full text link
    Named entity recognition (NER) is a vital task in spoken language understanding, which aims to identify mentions of named entities in text e.g., from transcribed speech. Existing neural models for NER rely mostly on dedicated word-level representations, which suffer from two main shortcomings. First, the vocabulary size is large, yielding large memory requirements and training time. Second, these models are not able to learn morphological or phonological representations. To remedy the above shortcomings, we adopt a neural solution based on bidirectional LSTMs and conditional random fields, where we rely on subword units, namely characters, phonemes, and bytes. For each word in an utterance, our model learns a representation from each of the subword units. We conducted experiments in a real-world large-scale setting for the use case of a voice-controlled device covering four languages with up to 5.5M utterances per language. Our experiments show that (1) with increasing training data, performance of models trained solely on subword units becomes closer to that of models with dedicated word-level embeddings (91.35 vs 93.92 F1 for English), while using a much smaller vocabulary size (332 vs 74K), (2) subword units enhance models with dedicated word-level embeddings, and (3) combining different subword units improves performance.Comment: 5 pages, INTERSPEECH 201

    Chinese named entity recognition using lexicalized HMMs

    Get PDF
    This paper presents a lexicalized HMM-based approach to Chinese named entity recognition (NER). To tackle the problem of unknown words, we unify unknown word identification and NER as a single tagging task on a sequence of known words. To do this, we first employ a known-word bigram-based model to segment a sentence into a sequence of known words, and then apply the uniformly lexicalized HMMs to assign each known word a proper hybrid tag that indicates its pattern in forming an entity and the category of the formed entity. Our system is able to integrate both the internal formation patterns and the surrounding contextual clues for NER under the framework of HMMs. As a result, the performance of the system can be improved without losing its efficiency in training and tagging. We have tested our system using different public corpora. The results show that lexicalized HMMs can substantially improve NER performance over standard HMMs. The results also indicate that character-based tagging (viz. the tagging based on pure single-character words) is comparable to and can even outperform the relevant known-word based tagging when a lexicalization technique is applied.postprin

    A System for Identifying Named Entities in Biomedical Text: how Results From two Evaluations Reflect on Both the System and the Evaluations

    Get PDF
    We present a maximum entropy-based system for identifying named entities (NEs) in biomedical abstracts and present its performance in the only two biomedical named entity recognition (NER) comparative evaluations that have been held to date, namely BioCreative and Coling BioNLP. Our system obtained an exact match F-score of 83.2% in the BioCreative evaluation and 70.1% in the BioNLP evaluation. We discuss our system in detail, including its rich use of local features, attention to correct boundary identification, innovative use of external knowledge resources, including parsing and web searches, and rapid adaptation to new NE sets. We also discuss in depth problems with data annotation in the evaluations which caused the final performance to be lower than optimal

    Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media

    Full text link
    Recognizing named entities in a document is a key task in many NLP applications. Although current state-of-the-art approaches to this task reach a high performance on clean text (e.g. newswire genres), those algorithms dramatically degrade when they are moved to noisy environments such as social media domains. We present two systems that address the challenges of processing social media data using character-level phonetics and phonology, word embeddings, and Part-of-Speech tags as features. The first model is a multitask end-to-end Bidirectional Long Short-Term Memory (BLSTM)-Conditional Random Field (CRF) network whose output layer contains two CRF classifiers. The second model uses a multitask BLSTM network as feature extractor that transfers the learning to a CRF classifier for the final prediction. Our systems outperform the current F1 scores of the state of the art on the Workshop on Noisy User-generated Text 2017 dataset by 2.45% and 3.69%, establishing a more suitable approach for social media environments.Comment: NAACL 201
    corecore