13,482 research outputs found
Lexicon Infused Phrase Embeddings for Named Entity Resolution
Most state-of-the-art approaches for named-entity recognition (NER) use semi
supervised information in the form of word clusters and lexicons. Recently
neural network-based language models have been explored, as they as a byproduct
generate highly informative vector representations for words, known as word
embeddings. In this paper we present two contributions: a new form of learning
word embeddings that can leverage information from relevant lexicons to improve
the representations, and the first system to use neural word embeddings to
achieve state-of-the-art results on named-entity recognition in both CoNLL and
Ontonotes NER. Our system achieves an F1 score of 90.90 on the test set for
CoNLL 2003---significantly better than any previous system trained on public
data, and matching a system employing massive private industrial query-log
data.Comment: Accepted in CoNLL 201
Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter
Social spam produces a great amount of noise on social media services such as
Twitter, which reduces the signal-to-noise ratio that both end users and data
mining applications observe. Existing techniques on social spam detection have
focused primarily on the identification of spam accounts by using extensive
historical and network-based data. In this paper we focus on the detection of
spam tweets, which optimises the amount of data that needs to be gathered by
relying only on tweet-inherent features. This enables the application of the
spam detection system to a large set of tweets in a timely fashion, potentially
applicable in a real-time or near real-time setting. Using two large
hand-labelled datasets of tweets containing spam, we study the suitability of
five classification algorithms and four different feature sets to the social
spam detection task. Our results show that, by using the limited set of
features readily available in a tweet, we can achieve encouraging results which
are competitive when compared against existing spammer detection systems that
make use of additional, costly user features. Our study is the first that
attempts at generalising conclusions on the optimal classifiers and sets of
features for social spam detection over different datasets
- …