7,932 research outputs found

    Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

    Full text link
    We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance

    KnowNER: Incremental Multilingual Knowledge in Named Entity Recognition

    Full text link
    KnowNER is a multilingual Named Entity Recognition (NER) system that leverages different degrees of external knowledge. A novel modular framework divides the knowledge into four categories according to the depth of knowledge they convey. Each category consists of a set of features automatically generated from different information sources (such as a knowledge-base, a list of names or document-specific semantic annotations) and is used to train a conditional random field (CRF). Since those information sources are usually multilingual, KnowNER can be easily trained for a wide range of languages. In this paper, we show that the incorporation of deeper knowledge systematically boosts accuracy and compare KnowNER with state-of-the-art NER approaches across three languages (i.e., English, German and Spanish) performing amongst state-of-the art systems in all of them

    Lexicon Infused Phrase Embeddings for Named Entity Resolution

    Full text link
    Most state-of-the-art approaches for named-entity recognition (NER) use semi supervised information in the form of word clusters and lexicons. Recently neural network-based language models have been explored, as they as a byproduct generate highly informative vector representations for words, known as word embeddings. In this paper we present two contributions: a new form of learning word embeddings that can leverage information from relevant lexicons to improve the representations, and the first system to use neural word embeddings to achieve state-of-the-art results on named-entity recognition in both CoNLL and Ontonotes NER. Our system achieves an F1 score of 90.90 on the test set for CoNLL 2003---significantly better than any previous system trained on public data, and matching a system employing massive private industrial query-log data.Comment: Accepted in CoNLL 201

    Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks

    Full text link
    Selecting optimal parameters for a neural network architecture can often make the difference between mediocre and state-of-the-art performance. However, little is published which parameters and design choices should be evaluated or selected making the correct hyperparameter optimization often a "black art that requires expert experiences" (Snoek et al., 2012). In this paper, we evaluate the importance of different network design choices and hyperparameters for five common linguistic sequence tagging tasks (POS, Chunking, NER, Entity Recognition, and Event Detection). We evaluated over 50.000 different setups and found, that some parameters, like the pre-trained word embeddings or the last layer of the network, have a large impact on the performance, while other parameters, for example the number of LSTM layers or the number of recurrent units, are of minor importance. We give a recommendation on a configuration that performs well among different tasks.Comment: 34 pages. 9 page version of this paper published at EMNLP 201

    The Application of Text Mining and Data Visualization Techniques to Textual Corpus Exploration

    Get PDF
    Unstructured data in the digital universe is growing rapidly and shows no evidence of slowing anytime soon. With the acceleration of growth in digital data being generated and stored on the World Wide Web, the prospect of information overload is much more prevalent now than it has been in the past. As a preemptive analytic measure, organizations across many industries have begun implementing text mining techniques to analyze such large sources of unstructured data. Utilizing various text mining techniques such as n -gram analysis, document and term frequency analysis, correlation analysis, and topic modeling methodologies, this research seeks to develop a tool to allow analysts to maneuver effectively and efficiently through large corpuses of potentially unknown textual data. Additionally, this research explores two notional data exploration scenarios through a large corpus of text data, each exhibiting unique navigation methods analysts may elect to take. Research concludes with the validation of inferential results obtained through each corpus’s exploration scenario

    Chinese named entity recognition using lexicalized HMMs

    Get PDF
    This paper presents a lexicalized HMM-based approach to Chinese named entity recognition (NER). To tackle the problem of unknown words, we unify unknown word identification and NER as a single tagging task on a sequence of known words. To do this, we first employ a known-word bigram-based model to segment a sentence into a sequence of known words, and then apply the uniformly lexicalized HMMs to assign each known word a proper hybrid tag that indicates its pattern in forming an entity and the category of the formed entity. Our system is able to integrate both the internal formation patterns and the surrounding contextual clues for NER under the framework of HMMs. As a result, the performance of the system can be improved without losing its efficiency in training and tagging. We have tested our system using different public corpora. The results show that lexicalized HMMs can substantially improve NER performance over standard HMMs. The results also indicate that character-based tagging (viz. the tagging based on pure single-character words) is comparable to and can even outperform the relevant known-word based tagging when a lexicalization technique is applied.postprin

    Attribute Identification and Predictive Customisation Using Fuzzy Clustering and Genetic Search for Industry 4.0 Environments

    Get PDF
    Today´s factory involves more services and customisation. A paradigm shift is towards “Industry 4.0” (i4) aiming at realising mass customisation at a mass production cost. However, there is a lack of tools for customer informatics. This paper addresses this issue and develops a predictive analytics framework integrating big data analysis and business informatics, using Computational Intelligence (CI). In particular, a fuzzy c-means is used for pattern recognition, as well as managing relevant big data for feeding potential customer needs and wants for improved productivity at the design stage for customised mass production. The selection of patterns from big data is performed using a genetic algorithm with fuzzy c-means, which helps with clustering and selection of optimal attributes. The case study shows that fuzzy c-means are able to assign new clusters with growing knowledge of customer needs and wants. The dataset has three types of entities: specification of various characteristics, assigned insurance risk rating, and normalised losses in use compared with other cars. The fuzzy c-means tool offers a number of features suitable for smart designs for an i4 environment
    • …
    corecore