19,801 research outputs found

    Stance Prediction for Russian: Data and Analysis

    Get PDF
    Stance detection is a critical component of rumour and fake news identification. It involves the extraction of the stance a particular author takes related to a given claim, both expressed in text. This paper investigates stance classification for Russian. It introduces a new dataset, RuStance, of Russian tweets and news comments from multiple sources, covering multiple stories, as well as text classification approaches to stance detection as benchmarks over this data in this language. As well as presenting this openly-available dataset, the first of its kind for Russian, the paper presents a baseline for stance prediction in the language

    Extracting News Events from Microblogs

    Full text link
    Twitter stream has become a large source of information for many people, but the magnitude of tweets and the noisy nature of its content have made harvesting the knowledge from Twitter a challenging task for researchers for a long time. Aiming at overcoming some of the main challenges of extracting the hidden information from tweet streams, this work proposes a new approach for real-time detection of news events from the Twitter stream. We divide our approach into three steps. The first step is to use a neural network or deep learning to detect news-relevant tweets from the stream. The second step is to apply a novel streaming data clustering algorithm to the detected news tweets to form news events. The third and final step is to rank the detected events based on the size of the event clusters and growth speed of the tweet frequencies. We evaluate the proposed system on a large, publicly available corpus of annotated news events from Twitter. As part of the evaluation, we compare our approach with a related state-of-the-art solution. Overall, our experiments and user-based evaluation show that our approach on detecting current (real) news events delivers a state-of-the-art performance

    Real-Time Classification of Twitter Trends

    Get PDF
    Social media users give rise to social trends as they share about common interests, which can be triggered by different reasons. In this work, we explore the types of triggers that spark trends on Twitter, introducing a typology with following four types: 'news', 'ongoing events', 'memes', and 'commemoratives'. While previous research has analyzed trending topics in a long term, we look at the earliest tweets that produce a trend, with the aim of categorizing trends early on. This would allow to provide a filtered subset of trends to end users. We analyze and experiment with a set of straightforward language-independent features based on the social spread of trends to categorize them into the introduced typology. Our method provides an efficient way to accurately categorize trending topics without need of external data, enabling news organizations to discover breaking news in real-time, or to quickly identify viral memes that might enrich marketing decisions, among others. The analysis of social features also reveals patterns associated with each type of trend, such as tweets about ongoing events being shorter as many were likely sent from mobile devices, or memes having more retweets originating from a few trend-setters.Comment: Pre-print of article accepted for publication in Journal of the American Society for Information Science and Technology copyright @ 2013 (American Society for Information Science and Technology

    Arabic sentence-level sentiment analysis

    Get PDF
    Sentiment analysis has recently become one of the growing areas of research related to text mining and natural language processing. The increasing availability of online resources and popularity of rich and fast resources for opinion sharing like news, online review sites and personal blogs, caused several parties such as customers, companies, and governments to start analyzing and exploring these opinions. The main task of sentiment classification is to classify a sentence (i.e. review, blog, comment, news, etc.) as holding an overall positive, negative or neutral sentiment. Most of the current studies related to this topic focus mainly on English texts with very limited resources available for other languages like Arabic, especially for the Egyptian dialect. In this research work, we would like to improve the performance measures of Egyptian dialect sentence-level sentiment analysis by proposing a hybrid approach which combines both the machine learning approach using support vector machines and the semantic orientation approach. Two methodologies were proposed, one for each approach, which were then joined, creating the hybrid proposed approach. The corpus used contains more than 20,000 Egyptian dialect tweets collected from Twitter, from which 4800 manually annotated tweets will be used (1600 positive tweets, 1600 negative tweets and 1600 neutral tweets). We performed several experiments to: 1) compare the results of each approach individually with regards to our case which is dealing with the Egyptian dialect before and after preprocessing; 2) compare the performance of merging both approaches together generating the hybrid approach against the performance of each approach separately; and 3) evaluate the effectiveness of considering negation on the performance of the hybrid approach. The results obtained show significant improvements in terms of the accuracy, precision, recall and F-measure, indicating that our proposed hybrid approach is effective in sentence-level sentiment classification. Also, the results are very promising which encourages continuing in this line of research

    Organized Behavior Classification of Tweet Sets using Supervised Learning Methods

    Full text link
    During the 2016 US elections Twitter experienced unprecedented levels of propaganda and fake news through the collaboration of bots and hired persons, the ramifications of which are still being debated. This work proposes an approach to identify the presence of organized behavior in tweets. The Random Forest, Support Vector Machine, and Logistic Regression algorithms are each used to train a model with a data set of 850 records consisting of 299 features extracted from tweets gathered during the 2016 US presidential election. The features represent user and temporal synchronization characteristics to capture coordinated behavior. These models are trained to classify tweet sets among the categories: organic vs organized, political vs non-political, and pro-Trump vs pro-Hillary vs neither. The random forest algorithm performs better with greater than 95% average accuracy and f-measure scores for each category. The most valuable features for classification are identified as user based features, with media use and marking tweets as favorite to be the most dominant.Comment: 51 pages, 5 figure

    Tracking Dengue Epidemics using Twitter Content Classification and Topic Modelling

    Full text link
    Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue and Zika in Brasil and other tropical regions has long been a priority for governments in affected areas. Streaming social media content, such as Twitter, is increasingly being used for health vigilance applications such as flu detection. However, previous work has not addressed the complexity of drastic seasonal changes on Twitter content across multiple epidemic outbreaks. In order to address this gap, this paper contrasts two complementary approaches to detecting Twitter content that is relevant for Dengue outbreak detection, namely supervised classification and unsupervised clustering using topic modelling. Each approach has benefits and shortcomings. Our classifier achieves a prediction accuracy of about 80\% based on a small training set of about 1,000 instances, but the need for manual annotation makes it hard to track seasonal changes in the nature of the epidemics, such as the emergence of new types of virus in certain geographical locations. In contrast, LDA-based topic modelling scales well, generating cohesive and well-separated clusters from larger samples. While clusters can be easily re-generated following changes in epidemics, however, this approach makes it hard to clearly segregate relevant tweets into well-defined clusters.Comment: Procs. SoWeMine - co-located with ICWE 2016. 2016, Lugano, Switzerlan
    corecore