2,776 research outputs found
Data Sets: Word Embeddings Learned from Tweets and General Data
A word embedding is a low-dimensional, dense and real- valued vector
representation of a word. Word embeddings have been used in many NLP tasks.
They are usually gener- ated from a large text corpus. The embedding of a word
cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and
have unique lexical and semantic features that are different from other types
of text. Therefore, it is necessary to have word embeddings learned
specifically from tweets. In this paper, we present ten word embedding data
sets. In addition to the data sets learned from just tweet data, we also built
embedding sets from the general data and the combination of tweets with the
general data. The general data consist of news articles, Wikipedia data and
other web data. These ten embedding models were learned from about 400 million
tweets and 7 billion words from the general text. In this paper, we also
present two experiments demonstrating how to use the data sets in some NLP
tasks, such as tweet sentiment analysis and tweet topic classification tasks
Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm
NLP tasks are often limited by scarcity of manually annotated data. In social
media sentiment analysis and related tasks, researchers have therefore used
binarized emoticons and specific hashtags as forms of distant supervision. Our
paper shows that by extending the distant supervision to a more diverse set of
noisy labels, the models can learn richer representations. Through emoji
prediction on a dataset of 1246 million tweets containing one of 64 common
emojis we obtain state-of-the-art performance on 8 benchmark datasets within
sentiment, emotion and sarcasm detection using a single pretrained model. Our
analyses confirm that the diversity of our emotional labels yield a performance
improvement over previous distant supervision approaches.Comment: Accepted at EMNLP 2017. Please include EMNLP in any citations. Minor
changes from the EMNLP camera-ready version. 9 pages + references and
supplementary materia
Transportation in Social Media: an automatic classifier for travel-related tweets
In the last years researchers in the field of intelligent transportation
systems have made several efforts to extract valuable information from social
media streams. However, collecting domain-specific data from any social media
is a challenging task demanding appropriate and robust classification methods.
In this work we focus on exploring geo-located tweets in order to create a
travel-related tweet classifier using a combination of bag-of-words and word
embeddings. The resulting classification makes possible the identification of
interesting spatio-temporal relations in S\~ao Paulo and Rio de Janeiro
- …