94,426 research outputs found
A Multi-task Approach for Named Entity Recognition in Social Media Data
Named Entity Recognition for social media data is challenging because of its
inherent noisiness. In addition to improper grammatical structures, it contains
spelling inconsistencies and numerous informal abbreviations. We propose a
novel multi-task approach by employing a more general secondary task of Named
Entity (NE) segmentation together with the primary task of fine-grained NE
categorization. The multi-task neural network architecture learns higher order
feature representations from word and character sequences along with basic
Part-of-Speech tags and gazetteer information. This neural network acts as a
feature extractor to feed a Conditional Random Fields classifier. We were able
to obtain the first position in the 3rd Workshop on Noisy User-generated Text
(WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score.Comment: EMNLP 2017 (W-NUT
Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning
Named entity recognition, and other information extraction tasks, frequently
use linguistic features such as part of speech tags or chunkings. For languages
where word boundaries are not readily identified in text, word segmentation is
a key first step to generating features for an NER system. While using word
boundary tags as features are helpful, the signals that aid in identifying
these boundaries may provide richer information for an NER system. New
state-of-the-art word segmentation systems use neural models to learn
representations for predicting word boundaries. We show that these same
representations, jointly trained with an NER system, yield significant
improvements in NER for Chinese social media. In our experiments, jointly
training NER and word segmentation with an LSTM-CRF model yields nearly 5%
absolute improvement over previously published results.Comment: This is the camera ready version of our ACL'16 paper. We also added a
supplementary material containing the results of our systems on a cleaner
dataset (much higher F1 scores). More information please refer to the repo
https://github.com/hltcoe/golden-hors
Cross Script Hindi English NER Corpus from Wikipedia
The text generated on social media platforms is essentially a mixed lingual
text. The mixing of language in any form produces considerable amount of
difficulty in language processing systems. Moreover, the advancements in
language processing research depends upon the availability of standard corpora.
The development of mixed lingual Indian Named Entity Recognition (NER) systems
are facing obstacles due to unavailability of the standard evaluation corpora.
Such corpora may be of mixed lingual nature in which text is written using
multiple languages predominantly using a single script only. The motivation of
our work is to emphasize the automatic generation such kind of corpora in order
to encourage mixed lingual Indian NER. The paper presents the preparation of a
Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora
is successfully annotated using standard CoNLL-2003 categories of PER, LOC,
ORG, and MISC. Its evaluation is carried out on a variety of machine learning
algorithms and favorable results are achieved.Comment: International Conference on Intelligent Data Communication
Technologies and Internet of Things (ICICI-2018
Neural Adaptation Layers for Cross-domain Named Entity Recognition
Recent research efforts have shown that neural architectures can be effective
in conventional information extraction tasks such as named entity recognition,
yielding state-of-the-art results on standard newswire datasets. However,
despite significant resources required for training such models, the
performance of a model trained on one domain typically degrades dramatically
when applied to a different domain, yet extracting entities from new emerging
domains such as social media can be of significant interest. In this paper, we
empirically investigate effective methods for conveniently adapting an
existing, well-trained neural NER model for a new domain. Unlike existing
approaches, we propose lightweight yet effective methods for performing domain
adaptation for neural models. Specifically, we introduce adaptation layers on
top of existing neural architectures, where no re-training using the source
domain data is required. We conduct extensive empirical studies and show that
our approach significantly outperforms state-of-the-art methods.Comment: 11 pages, accepted as a long paper in EMNLP 201
A survey of methods to ease the development of highly multilingual text mining applications
Multilingual text processing is useful because the information content found
in different languages is complementary, both regarding facts and opinions.
While Information Extraction and other text mining software can, in principle,
be developed for many languages, most text analysis tools have only been
applied to small sets of languages because the development effort per language
is large. Self-training tools obviously alleviate the problem, but even the
effort of providing training data and of manually tuning the results is usually
considerable. In this paper, we gather insights by various multilingual system
developers on how to minimise the effort of developing natural language
processing applications for many languages. We also explain the main guidelines
underlying our own effort to develop complex text mining software for tens of
languages. While these guidelines - most of all: extreme simplicity - can be
very restrictive and limiting, we believe to have shown the feasibility of the
approach through the development of the Europe Media Monitor (EMM) family of
applications (http://emm.newsbrief.eu/overview.html). EMM is a set of complex
media monitoring tools that process and analyse up to 100,000 online news
articles per day in between twenty and fifty languages. We will also touch upon
the kind of language resources that would make it easier for all to develop
highly multilingual text mining applications. We will argue that - to achieve
this - the most needed resources would be freely available, simple, parallel
and uniform multilingual dictionaries, corpora and software tools.Comment: 22 pages. Published online on 12 October 201
CAN-NER: Convolutional Attention Network for Chinese Named Entity Recognition
Named entity recognition (NER) in Chinese is essential but difficult because
of the lack of natural delimiters. Therefore, Chinese Word Segmentation (CWS)
is usually considered as the first step for Chinese NER. However, models based
on word-level embeddings and lexicon features often suffer from segmentation
errors and out-of-vocabulary (OOV) words. In this paper, we investigate a
Convolutional Attention Network called CAN for Chinese NER, which consists of a
character-based convolutional neural network (CNN) with local-attention layer
and a gated recurrent unit (GRU) with global self-attention layer to capture
the information from adjacent characters and sentence contexts. Also, compared
to other models, not depending on any external resources like lexicons and
employing small size of char embeddings make our model more practical.
Extensive experimental results show that our approach outperforms
state-of-the-art methods without word embedding and external lexicon resources
on different domain datasets including Weibo, MSRA and Chinese Resume NER
dataset.Comment: This paper is accepted by NAACL-HLT 2019. The code is available at
https://github.com/microsoft/vert-papers/tree/master/papers/CAN-NE
Deep Neural Networks Ensemble for Detecting Medication Mentions in Tweets
Objective: After years of research, Twitter posts are now recognized as an
important source of patient-generated data, providing unique insights into
population health. A fundamental step to incorporating Twitter data in
pharmacoepidemiological research is to automatically recognize medication
mentions in tweets. Given that lexical searches for medication names may fail
due to misspellings or ambiguity with common words, we propose a more advanced
method to recognize them. Methods: We present Kusuri, an Ensemble Learning
classifier, able to identify tweets mentioning drug products and dietary
supplements. Kusuri ("medication" in Japanese) is composed of two modules.
First, four different classifiers (lexicon-based, spelling-variant-based,
pattern-based and one based on a weakly-trained neural network) are applied in
parallel to discover tweets potentially containing medication names. Second, an
ensemble of deep neural networks encoding morphological, semantical and
long-range dependencies of important words in the tweets discovered is used to
make the final decision. Results: On a balanced (50-50) corpus of 15,005
tweets, Kusuri demonstrated performances close to human annotators with 93.7%
F1-score, the best score achieved thus far on this corpus. On a corpus made of
all tweets posted by 113 Twitter users (98,959 tweets, with only 0.26%
mentioning medications), Kusuri obtained 76.3% F1-score. There is not a prior
drug extraction system that compares running on such an extremely unbalanced
dataset. Conclusion: The system identifies tweets mentioning drug names with
performance high enough to ensure its usefulness and ready to be integrated in
larger natural language processing systems.Comment: This is a pre-copy-editing, author-produced PDF of an article
accepted for publication in JAMIA following peer review. The definitive
publisher-authenticated version is "D. Weissenbacher, A. Sarker, A. Klein, K.
O'Connor, A. Magge, G. Gonzalez-Hernandez, Deep neural networks ensemble for
detecting medication mentions in tweets, Journal of the American Medical
Informatics Association, ocz156, 2019
Location reference identification from tweets during emergencies: A deep learning approach
Twitter is recently being used during crises to communicate with officials
and provide rescue and relief operation in real time. The geographical location
information of the event, as well as users, are vitally important in such
scenarios. The identification of geographic location is one of the challenging
tasks as the location information fields, such as user location and place name
of tweets are not reliable. The extraction of location information from tweet
text is difficult as it contains a lot of non-standard English, grammatical
errors, spelling mistakes, non-standard abbreviations, and so on. This research
aims to extract location words used in the tweet using a Convolutional Neural
Network (CNN) based model. We achieved the exact matching score of 0.929,
Hamming loss of 0.002, and -score of 0.96 for the tweets related to the
earthquake. Our model was able to extract even three- to four-word long
location references which is also evident from the exact matching score of over
92\%. The findings of this paper can help in early event localization,
emergency situations, real-time road traffic management, localized
advertisement, and in various location-based services
Stance Detection on Tweets: An SVM-based Approach
Stance detection is a subproblem of sentiment analysis where the stance of
the author of a piece of natural language text for a particular target (either
explicitly stated in the text or not) is explored. The stance output is usually
given as Favor, Against, or Neither. In this paper, we target at stance
detection on sports-related tweets and present the performance results of our
SVM-based stance classifiers on such tweets. First, we describe three versions
of our proprietary tweet data set annotated with stance information, all of
which are made publicly available for research purposes. Next, we evaluate SVM
classifiers using different feature sets for stance detection on this data set.
The employed features are based on unigrams, bigrams, hashtags, external links,
emoticons, and lastly, named entities. The results indicate that joint use of
the features based on unigrams, hashtags, and named entities by SVM classifiers
is a plausible approach for stance detection problem on sports-related tweets.Comment: 13 page
Understanding Scanned Receipts
Tasking machines with understanding receipts can have important applications
such as enabling detailed analytics on purchases, enforcing expense policies,
and inferring patterns of purchase behavior on large collections of receipts.
In this paper, we focus on the task of Named Entity Linking (NEL) of scanned
receipt line items; specifically, the task entails associating shorthand text
from OCR'd receipts with a knowledge base (KB) of grocery products. For
example, the scanned item "STO BABY SPINACH" should be linked to the catalog
item labeled "Simple Truth Organic Baby Spinach". Experiments that employ a
variety of Information Retrieval techniques in combination with statistical
phrase detection shows promise for effective understanding of scanned receipt
data.Comment: 8 pages, 3 figures, no conference submissio
- …