7,852 research outputs found

    Towards Real-Time, Country-Level Location Classification of Worldwide Tweets

    Get PDF
    In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyse the extent to which a tweet's country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyse the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as the use of tweet content alone -- the most widely used feature in previous work -- leaves much to be desired. Choosing an appropriate combination of both tweet content and metadata can actually lead to substantial improvements of between 20\% and 50\%. We observe that tweet content, the user's self-reported location and the user's real name, all of which are inherent in a tweet and available in a real-time scenario, are particularly useful to determine the country of origin. We also experiment on the applicability of a model trained on historical tweets to classify new tweets, finding that the choice of a particular combination of features whose utility does not fade over time can actually lead to comparable performance, avoiding the need to retrain. However, the difficulty of achieving accurate classification increases slightly for countries with multiple commonalities, especially for English and Spanish speaking countries.Comment: Accepted for publication in IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE

    Exploring Social Media for Event Attendance

    Get PDF
    Large popular events are nowadays well reflected in social media fora (e.g. Twitter), where people discuss their interest in participating in the events. In this paper we propose to exploit the content of non-geotagged posts in social media to build machine-learned classifiers able to infer users' attendance of large events in three temporal periods: before, during and after an event. The categories of features used to train the classifier reflect four different dimensions of social media: textual, temporal, social, and multimedia content. We detail the approach followed to design the feature space and report on experiments conducted on two large music festivals in the UK, namely the VFestival and Creamfields events. Our attendance classifier attains very high accuracy with the highest result observed for the Creamfields dataset ~87% accuracy to classify users that will participate in the event

    You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information

    Get PDF
    Metadata are associated to most of the information we produce in our daily interactions and communication in the digital world. Yet, surprisingly, metadata are often still catergorized as non-sensitive. Indeed, in the past, researchers and practitioners have mainly focused on the problem of the identification of a user from the content of a message. In this paper, we use Twitter as a case study to quantify the uniqueness of the association between metadata and user identity and to understand the effectiveness of potential obfuscation strategies. More specifically, we analyze atomic fields in the metadata and systematically combine them in an effort to classify new tweets as belonging to an account using different machine learning algorithms of increasing complexity. We demonstrate that through the application of a supervised learning algorithm, we are able to identify any user in a group of 10,000 with approximately 96.7% accuracy. Moreover, if we broaden the scope of our search and consider the 10 most likely candidates we increase the accuracy of the model to 99.22%. We also found that data obfuscation is hard and ineffective for this type of data: even after perturbing 60% of the training data, it is still possible to classify users with an accuracy higher than 95%. These results have strong implications in terms of the design of metadata obfuscation strategies, for example for data set release, not only for Twitter, but, more generally, for most social media platforms.Comment: 11 pages, 13 figures. Published in the Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM 2018). June 2018. Stanford, CA, US

    Social Bots for Online Public Health Interventions

    Full text link
    According to the Center for Disease Control and Prevention, in the United States hundreds of thousands initiate smoking each year, and millions live with smoking-related dis- eases. Many tobacco users discuss their habits and preferences on social media. This work conceptualizes a framework for targeted health interventions to inform tobacco users about the consequences of tobacco use. We designed a Twitter bot named Notobot (short for No-Tobacco Bot) that leverages machine learning to identify users posting pro-tobacco tweets and select individualized interventions to address their interest in tobacco use. We searched the Twitter feed for tobacco-related keywords and phrases, and trained a convolutional neural network using over 4,000 tweets dichotomously manually labeled as either pro- tobacco or not pro-tobacco. This model achieves a 90% recall rate on the training set and 74% on test data. Users posting pro- tobacco tweets are matched with former smokers with similar interests who posted anti-tobacco tweets. Algorithmic matching, based on the power of peer influence, allows for the systematic delivery of personalized interventions based on real anti-tobacco tweets from former smokers. Experimental evaluation suggests that our system would perform well if deployed. This research offers opportunities for public health researchers to increase health awareness at scale. Future work entails deploying the fully operational Notobot system in a controlled experiment within a public health campaign

    Organized Behavior Classification of Tweet Sets using Supervised Learning Methods

    Full text link
    During the 2016 US elections Twitter experienced unprecedented levels of propaganda and fake news through the collaboration of bots and hired persons, the ramifications of which are still being debated. This work proposes an approach to identify the presence of organized behavior in tweets. The Random Forest, Support Vector Machine, and Logistic Regression algorithms are each used to train a model with a data set of 850 records consisting of 299 features extracted from tweets gathered during the 2016 US presidential election. The features represent user and temporal synchronization characteristics to capture coordinated behavior. These models are trained to classify tweet sets among the categories: organic vs organized, political vs non-political, and pro-Trump vs pro-Hillary vs neither. The random forest algorithm performs better with greater than 95% average accuracy and f-measure scores for each category. The most valuable features for classification are identified as user based features, with media use and marking tweets as favorite to be the most dominant.Comment: 51 pages, 5 figure
    corecore