94,426 research outputs found

    A Multi-task Approach for Named Entity Recognition in Social Media Data

    Full text link
    Named Entity Recognition for social media data is challenging because of its inherent noisiness. In addition to improper grammatical structures, it contains spelling inconsistencies and numerous informal abbreviations. We propose a novel multi-task approach by employing a more general secondary task of Named Entity (NE) segmentation together with the primary task of fine-grained NE categorization. The multi-task neural network architecture learns higher order feature representations from word and character sequences along with basic Part-of-Speech tags and gazetteer information. This neural network acts as a feature extractor to feed a Conditional Random Fields classifier. We were able to obtain the first position in the 3rd Workshop on Noisy User-generated Text (WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score.Comment: EMNLP 2017 (W-NUT

    Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning

    Full text link
    Named entity recognition, and other information extraction tasks, frequently use linguistic features such as part of speech tags or chunkings. For languages where word boundaries are not readily identified in text, word segmentation is a key first step to generating features for an NER system. While using word boundary tags as features are helpful, the signals that aid in identifying these boundaries may provide richer information for an NER system. New state-of-the-art word segmentation systems use neural models to learn representations for predicting word boundaries. We show that these same representations, jointly trained with an NER system, yield significant improvements in NER for Chinese social media. In our experiments, jointly training NER and word segmentation with an LSTM-CRF model yields nearly 5% absolute improvement over previously published results.Comment: This is the camera ready version of our ACL'16 paper. We also added a supplementary material containing the results of our systems on a cleaner dataset (much higher F1 scores). More information please refer to the repo https://github.com/hltcoe/golden-hors

    Cross Script Hindi English NER Corpus from Wikipedia

    Full text link
    The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC. Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved.Comment: International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI-2018

    Neural Adaptation Layers for Cross-domain Named Entity Recognition

    Full text link
    Recent research efforts have shown that neural architectures can be effective in conventional information extraction tasks such as named entity recognition, yielding state-of-the-art results on standard newswire datasets. However, despite significant resources required for training such models, the performance of a model trained on one domain typically degrades dramatically when applied to a different domain, yet extracting entities from new emerging domains such as social media can be of significant interest. In this paper, we empirically investigate effective methods for conveniently adapting an existing, well-trained neural NER model for a new domain. Unlike existing approaches, we propose lightweight yet effective methods for performing domain adaptation for neural models. Specifically, we introduce adaptation layers on top of existing neural architectures, where no re-training using the source domain data is required. We conduct extensive empirical studies and show that our approach significantly outperforms state-of-the-art methods.Comment: 11 pages, accepted as a long paper in EMNLP 201

    A survey of methods to ease the development of highly multilingual text mining applications

    Full text link
    Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Self-training tools obviously alleviate the problem, but even the effort of providing training data and of manually tuning the results is usually considerable. In this paper, we gather insights by various multilingual system developers on how to minimise the effort of developing natural language processing applications for many languages. We also explain the main guidelines underlying our own effort to develop complex text mining software for tens of languages. While these guidelines - most of all: extreme simplicity - can be very restrictive and limiting, we believe to have shown the feasibility of the approach through the development of the Europe Media Monitor (EMM) family of applications (http://emm.newsbrief.eu/overview.html). EMM is a set of complex media monitoring tools that process and analyse up to 100,000 online news articles per day in between twenty and fifty languages. We will also touch upon the kind of language resources that would make it easier for all to develop highly multilingual text mining applications. We will argue that - to achieve this - the most needed resources would be freely available, simple, parallel and uniform multilingual dictionaries, corpora and software tools.Comment: 22 pages. Published online on 12 October 201

    CAN-NER: Convolutional Attention Network for Chinese Named Entity Recognition

    Full text link
    Named entity recognition (NER) in Chinese is essential but difficult because of the lack of natural delimiters. Therefore, Chinese Word Segmentation (CWS) is usually considered as the first step for Chinese NER. However, models based on word-level embeddings and lexicon features often suffer from segmentation errors and out-of-vocabulary (OOV) words. In this paper, we investigate a Convolutional Attention Network called CAN for Chinese NER, which consists of a character-based convolutional neural network (CNN) with local-attention layer and a gated recurrent unit (GRU) with global self-attention layer to capture the information from adjacent characters and sentence contexts. Also, compared to other models, not depending on any external resources like lexicons and employing small size of char embeddings make our model more practical. Extensive experimental results show that our approach outperforms state-of-the-art methods without word embedding and external lexicon resources on different domain datasets including Weibo, MSRA and Chinese Resume NER dataset.Comment: This paper is accepted by NAACL-HLT 2019. The code is available at https://github.com/microsoft/vert-papers/tree/master/papers/CAN-NE

    Deep Neural Networks Ensemble for Detecting Medication Mentions in Tweets

    Full text link
    Objective: After years of research, Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step to incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names may fail due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them. Methods: We present Kusuri, an Ensemble Learning classifier, able to identify tweets mentioning drug products and dietary supplements. Kusuri ("medication" in Japanese) is composed of two modules. First, four different classifiers (lexicon-based, spelling-variant-based, pattern-based and one based on a weakly-trained neural network) are applied in parallel to discover tweets potentially containing medication names. Second, an ensemble of deep neural networks encoding morphological, semantical and long-range dependencies of important words in the tweets discovered is used to make the final decision. Results: On a balanced (50-50) corpus of 15,005 tweets, Kusuri demonstrated performances close to human annotators with 93.7% F1-score, the best score achieved thus far on this corpus. On a corpus made of all tweets posted by 113 Twitter users (98,959 tweets, with only 0.26% mentioning medications), Kusuri obtained 76.3% F1-score. There is not a prior drug extraction system that compares running on such an extremely unbalanced dataset. Conclusion: The system identifies tweets mentioning drug names with performance high enough to ensure its usefulness and ready to be integrated in larger natural language processing systems.Comment: This is a pre-copy-editing, author-produced PDF of an article accepted for publication in JAMIA following peer review. The definitive publisher-authenticated version is "D. Weissenbacher, A. Sarker, A. Klein, K. O'Connor, A. Magge, G. Gonzalez-Hernandez, Deep neural networks ensemble for detecting medication mentions in tweets, Journal of the American Medical Informatics Association, ocz156, 2019

    Location reference identification from tweets during emergencies: A deep learning approach

    Full text link
    Twitter is recently being used during crises to communicate with officials and provide rescue and relief operation in real time. The geographical location information of the event, as well as users, are vitally important in such scenarios. The identification of geographic location is one of the challenging tasks as the location information fields, such as user location and place name of tweets are not reliable. The extraction of location information from tweet text is difficult as it contains a lot of non-standard English, grammatical errors, spelling mistakes, non-standard abbreviations, and so on. This research aims to extract location words used in the tweet using a Convolutional Neural Network (CNN) based model. We achieved the exact matching score of 0.929, Hamming loss of 0.002, and F1F_1-score of 0.96 for the tweets related to the earthquake. Our model was able to extract even three- to four-word long location references which is also evident from the exact matching score of over 92\%. The findings of this paper can help in early event localization, emergency situations, real-time road traffic management, localized advertisement, and in various location-based services

    Stance Detection on Tweets: An SVM-based Approach

    Full text link
    Stance detection is a subproblem of sentiment analysis where the stance of the author of a piece of natural language text for a particular target (either explicitly stated in the text or not) is explored. The stance output is usually given as Favor, Against, or Neither. In this paper, we target at stance detection on sports-related tweets and present the performance results of our SVM-based stance classifiers on such tweets. First, we describe three versions of our proprietary tweet data set annotated with stance information, all of which are made publicly available for research purposes. Next, we evaluate SVM classifiers using different feature sets for stance detection on this data set. The employed features are based on unigrams, bigrams, hashtags, external links, emoticons, and lastly, named entities. The results indicate that joint use of the features based on unigrams, hashtags, and named entities by SVM classifiers is a plausible approach for stance detection problem on sports-related tweets.Comment: 13 page

    Understanding Scanned Receipts

    Full text link
    Tasking machines with understanding receipts can have important applications such as enabling detailed analytics on purchases, enforcing expense policies, and inferring patterns of purchase behavior on large collections of receipts. In this paper, we focus on the task of Named Entity Linking (NEL) of scanned receipt line items; specifically, the task entails associating shorthand text from OCR'd receipts with a knowledge base (KB) of grocery products. For example, the scanned item "STO BABY SPINACH" should be linked to the catalog item labeled "Simple Truth Organic Baby Spinach". Experiments that employ a variety of Information Retrieval techniques in combination with statistical phrase detection shows promise for effective understanding of scanned receipt data.Comment: 8 pages, 3 figures, no conference submissio
    • …
    corecore