7,852 research outputs found
Towards Real-Time, Country-Level Location Classification of Worldwide Tweets
In contrast to much previous work that has focused on location classification
of tweets restricted to a specific country, here we undertake the task in a
broader context by classifying global tweets at the country level, which is so
far unexplored in a real-time scenario. We analyse the extent to which a
tweet's country of origin can be determined by making use of eight
tweet-inherent features for classification. Furthermore, we use two datasets,
collected a year apart from each other, to analyse the extent to which a model
trained from historical tweets can still be leveraged for classification of new
tweets. With classification experiments on all 217 countries in our datasets,
as well as on the top 25 countries, we offer some insights into the best use of
tweet-inherent features for an accurate country-level classification of tweets.
We find that the use of a single feature, such as the use of tweet content
alone -- the most widely used feature in previous work -- leaves much to be
desired. Choosing an appropriate combination of both tweet content and metadata
can actually lead to substantial improvements of between 20\% and 50\%. We
observe that tweet content, the user's self-reported location and the user's
real name, all of which are inherent in a tweet and available in a real-time
scenario, are particularly useful to determine the country of origin. We also
experiment on the applicability of a model trained on historical tweets to
classify new tweets, finding that the choice of a particular combination of
features whose utility does not fade over time can actually lead to comparable
performance, avoiding the need to retrain. However, the difficulty of achieving
accurate classification increases slightly for countries with multiple
commonalities, especially for English and Spanish speaking countries.Comment: Accepted for publication in IEEE Transactions on Knowledge and Data
Engineering (IEEE TKDE
Exploring Social Media for Event Attendance
Large popular events are nowadays well reflected in social media fora (e.g. Twitter), where people discuss their interest in participating in the events. In this paper we propose to exploit the content of non-geotagged posts in social media to build machine-learned classifiers able to infer users' attendance of large events in three temporal periods: before, during and after an event. The categories of features used to train the classifier reflect four different dimensions of social media: textual, temporal, social, and multimedia content. We detail the approach followed to design the feature space and report on experiments conducted on two large music festivals in the UK, namely the VFestival and Creamfields events. Our attendance classifier attains very high accuracy with the highest result observed for the Creamfields dataset ~87% accuracy to classify users that will participate in the event
You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information
Metadata are associated to most of the information we produce in our daily
interactions and communication in the digital world. Yet, surprisingly,
metadata are often still catergorized as non-sensitive. Indeed, in the past,
researchers and practitioners have mainly focused on the problem of the
identification of a user from the content of a message.
In this paper, we use Twitter as a case study to quantify the uniqueness of
the association between metadata and user identity and to understand the
effectiveness of potential obfuscation strategies. More specifically, we
analyze atomic fields in the metadata and systematically combine them in an
effort to classify new tweets as belonging to an account using different
machine learning algorithms of increasing complexity. We demonstrate that
through the application of a supervised learning algorithm, we are able to
identify any user in a group of 10,000 with approximately 96.7% accuracy.
Moreover, if we broaden the scope of our search and consider the 10 most likely
candidates we increase the accuracy of the model to 99.22%. We also found that
data obfuscation is hard and ineffective for this type of data: even after
perturbing 60% of the training data, it is still possible to classify users
with an accuracy higher than 95%. These results have strong implications in
terms of the design of metadata obfuscation strategies, for example for data
set release, not only for Twitter, but, more generally, for most social media
platforms.Comment: 11 pages, 13 figures. Published in the Proceedings of the 12th
International AAAI Conference on Web and Social Media (ICWSM 2018). June
2018. Stanford, CA, US
Social Bots for Online Public Health Interventions
According to the Center for Disease Control and Prevention, in the United
States hundreds of thousands initiate smoking each year, and millions live with
smoking-related dis- eases. Many tobacco users discuss their habits and
preferences on social media. This work conceptualizes a framework for targeted
health interventions to inform tobacco users about the consequences of tobacco
use. We designed a Twitter bot named Notobot (short for No-Tobacco Bot) that
leverages machine learning to identify users posting pro-tobacco tweets and
select individualized interventions to address their interest in tobacco use.
We searched the Twitter feed for tobacco-related keywords and phrases, and
trained a convolutional neural network using over 4,000 tweets dichotomously
manually labeled as either pro- tobacco or not pro-tobacco. This model achieves
a 90% recall rate on the training set and 74% on test data. Users posting pro-
tobacco tweets are matched with former smokers with similar interests who
posted anti-tobacco tweets. Algorithmic matching, based on the power of peer
influence, allows for the systematic delivery of personalized interventions
based on real anti-tobacco tweets from former smokers. Experimental evaluation
suggests that our system would perform well if deployed. This research offers
opportunities for public health researchers to increase health awareness at
scale. Future work entails deploying the fully operational Notobot system in a
controlled experiment within a public health campaign
Organized Behavior Classification of Tweet Sets using Supervised Learning Methods
During the 2016 US elections Twitter experienced unprecedented levels of
propaganda and fake news through the collaboration of bots and hired persons,
the ramifications of which are still being debated. This work proposes an
approach to identify the presence of organized behavior in tweets. The Random
Forest, Support Vector Machine, and Logistic Regression algorithms are each
used to train a model with a data set of 850 records consisting of 299 features
extracted from tweets gathered during the 2016 US presidential election. The
features represent user and temporal synchronization characteristics to capture
coordinated behavior. These models are trained to classify tweet sets among the
categories: organic vs organized, political vs non-political, and pro-Trump vs
pro-Hillary vs neither. The random forest algorithm performs better with
greater than 95% average accuracy and f-measure scores for each category. The
most valuable features for classification are identified as user based
features, with media use and marking tweets as favorite to be the most
dominant.Comment: 51 pages, 5 figure
- …