3,310 research outputs found
Unleashing the Power of Hashtags in Tweet Analytics with Distributed Framework on Apache Storm
Twitter is a popular social network platform where users can interact and
post texts of up to 280 characters called tweets. Hashtags, hyperlinked words
in tweets, have increasingly become crucial for tweet retrieval and search.
Using hashtags for tweet topic classification is a challenging problem because
of context dependent among words, slangs, abbreviation and emoticons in a short
tweet along with evolving use of hashtags. Since Twitter generates millions of
tweets daily, tweet analytics is a fundamental problem of Big data stream that
often requires a real-time Distributed processing. This paper proposes a
distributed online approach to tweet topic classification with hashtags. Being
implemented on Apache Storm, a distributed real time framework, our approach
incrementally identifies and updates a set of strong predictors in the Na\"ive
Bayes model for classifying each incoming tweet instance. Preliminary
experiments show promising results with up to 97% accuracy and 37% increase in
throughput on eight processors.Comment: IEEE International Conference on Big Data 201
TK: The Twitter Top-K Keywords Benchmark
Information retrieval from textual data focuses on the construction of
vocabularies that contain weighted term tuples. Such vocabularies can then be
exploited by various text analysis algorithms to extract new knowledge, e.g.,
top-k keywords, top-k documents, etc. Top-k keywords are casually used for
various purposes, are often computed on-the-fly, and thus must be efficiently
computed. To compare competing weighting schemes and database implementations,
benchmarking is customary. To the best of our knowledge, no benchmark currently
addresses these problems. Hence, in this paper, we present a top-k keywords
benchmark, TK, which features a real tweet dataset and queries with
various complexities and selectivities. TK helps evaluate weighting
schemes and database implementations in terms of computing performance. To
illustrate TK's relevance and genericity, we successfully performed
tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on
different relational (Oracle, PostgreSQL) and document-oriented (MongoDB)
database implementations, on the other hand
Multitask Learning for Fine-Grained Twitter Sentiment Analysis
Traditional sentiment analysis approaches tackle problems like ternary
(3-category) and fine-grained (5-category) classification by learning the tasks
separately. We argue that such classification tasks are correlated and we
propose a multitask approach based on a recurrent neural network that benefits
by jointly learning them. Our study demonstrates the potential of multitask
models on this type of problems and improves the state-of-the-art results in
the fine-grained sentiment classification problem.Comment: International ACM SIGIR Conference on Research and Development in
Information Retrieval 201
Comparative Studies of Detecting Abusive Language on Twitter
The context-dependent nature of online aggression makes annotating large
collections of data extremely difficult. Previously studied datasets in abusive
language detection have been insufficient in size to efficiently train deep
learning models. Recently, Hate and Abusive Speech on Twitter, a dataset much
greater in size and reliability, has been released. However, this dataset has
not been comprehensively studied to its potential. In this paper, we conduct
the first comparative study of various learning models on Hate and Abusive
Speech on Twitter, and discuss the possibility of using additional features and
context data for improvements. Experimental results show that bidirectional GRU
networks trained on word-level features, with Latent Topic Clustering modules,
is the most accurate model scoring 0.805 F1.Comment: ALW2: 2nd Workshop on Abusive Language Online to be held at EMNLP
2018 (Brussels, Belgium), October 31st, 201
- âŠ