19,867 research outputs found

    Event detection in location-based social networks

    Get PDF
    With the advent of social networks and the rise of mobile technologies, users have become ubiquitous sensors capable of monitoring various real-world events in a crowd-sourced manner. Location-based social networks have proven to be faster than traditional media channels in reporting and geo-locating breaking news, i.e. Osama Bin Laden’s death was first confirmed on Twitter even before the announcement from the communication department at the White House. However, the deluge of user-generated data on these networks requires intelligent systems capable of identifying and characterizing such events in a comprehensive manner. The data mining community coined the term, event detection , to refer to the task of uncovering emerging patterns in data streams . Nonetheless, most data mining techniques do not reproduce the underlying data generation process, hampering to self-adapt in fast-changing scenarios. Because of this, we propose a probabilistic machine learning approach to event detection which explicitly models the data generation process and enables reasoning about the discovered events. With the aim to set forth the differences between both approaches, we present two techniques for the problem of event detection in Twitter : a data mining technique called Tweet-SCAN and a machine learning technique called Warble. We assess and compare both techniques in a dataset of tweets geo-located in the city of Barcelona during its annual festivities. Last but not least, we present the algorithmic changes and data processing frameworks to scale up the proposed techniques to big data workloads.This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract (TIN2015-65316), by the Severo Ochoa Program (SEV2015-0493), by SGR programs of the Catalan Government (2014-SGR-1051, 2014-SGR-118), Collectiveware (TIN2015-66863-C2-1-R) and BSC/UPC NVIDIA GPU Center of Excellence.We would also like to thank the reviewers for their constructive feedback.Peer ReviewedPostprint (author's final draft

    Active learning in annotating micro-blogs dealing with e-reputation

    Full text link
    Elections unleash strong political views on Twitter, but what do people really think about politics? Opinion and trend mining on micro blogs dealing with politics has recently attracted researchers in several fields including Information Retrieval and Machine Learning (ML). Since the performance of ML and Natural Language Processing (NLP) approaches are limited by the amount and quality of data available, one promising alternative for some tasks is the automatic propagation of expert annotations. This paper intends to develop a so-called active learning process for automatically annotating French language tweets that deal with the image (i.e., representation, web reputation) of politicians. Our main focus is on the methodology followed to build an original annotated dataset expressing opinion from two French politicians over time. We therefore review state of the art NLP-based ML algorithms to automatically annotate tweets using a manual initiation step as bootstrap. This paper focuses on key issues about active learning while building a large annotated data set from noise. This will be introduced by human annotators, abundance of data and the label distribution across data and entities. In turn, we show that Twitter characteristics such as the author's name or hashtags can be considered as the bearing point to not only improve automatic systems for Opinion Mining (OM) and Topic Classification but also to reduce noise in human annotations. However, a later thorough analysis shows that reducing noise might induce the loss of crucial information.Comment: Journal of Interdisciplinary Methodologies and Issues in Science - Vol 3 - Contextualisation digitale - 201

    Illicit Activity Detection in Large-Scale Dark and Opaque Web Social Networks

    Get PDF
    Many online chat applications live in a grey area between the legitimate web and the dark net. The Telegram network in particular can aid criminal activities. Telegram hosts “chats” which consist of varied conversations and advertisements. These chats take place among automated “bots” and human users. Classifying legitimate activity from illegitimate activity can aid law enforcement in finding criminals. Social network analysis of Telegram chats presents a difficult problem. Users can change their username or create new accounts. Users involved in criminal activity often do this to obscure their identity. This makes establishing the unique identity behind a given username challenging. Thus we explored classifying users from their language usage in their chat messages.The volume and velocity of Telegram chat data place it well within the domain of big data. Machine learning and natural language processing (NLP) tools are necessary to classify this chat data. We developed NLP tools for classifying users and the chat group to which their messages belong. We found that legitimate and illegitimate chat groups could be classified with high accuracy. We also were able to classify bots, humans, and advertisements within conversations

    PresenceSense: Zero-training Algorithm for Individual Presence Detection based on Power Monitoring

    Full text link
    Non-intrusive presence detection of individuals in commercial buildings is much easier to implement than intrusive methods such as passive infrared, acoustic sensors, and camera. Individual power consumption, while providing useful feedback and motivation for energy saving, can be used as a valuable source for presence detection. We conduct pilot experiments in an office setting to collect individual presence data by ultrasonic sensors, acceleration sensors, and WiFi access points, in addition to the individual power monitoring data. PresenceSense (PS), a semi-supervised learning algorithm based on power measurement that trains itself with only unlabeled data, is proposed, analyzed and evaluated in the study. Without any labeling efforts, which are usually tedious and time consuming, PresenceSense outperforms popular models whose parameters are optimized over a large training set. The results are interpreted and potential applications of PresenceSense on other data sources are discussed. The significance of this study attaches to space security, occupancy behavior modeling, and energy saving of plug loads.Comment: BuildSys 201

    Early Detection of Mass Disaster Events Using Social Media Data

    Get PDF
    During a mass disaster, social media are a major source of information providing first-hand accounts of the unfolding situation. Automated ways to discover and collate this information in real-time can be of critical value for humanitarian operations. Prior work on this task largely focused on developing message classifiers restricted to particular types of disasters, such as storms or wildfires. In this paper we investigate machine-learning methods to detect crisis-related messages where the type of the crisis is not known in advance. The methods are potentially of a much greater practical value, as they can provide the means to deal with a wide range of crisis situations, including those that involve combinations of disaster types and types that were unknown at the training stage. The key challenge with this task is the fact that events of potential relevance are extremely diverse and correspondingly both training and test data are highly heterogeneous. The data heterogeneity causes significant difficulties for machine learning algorithms to generalize and accurately label incoming data. Our main contributions are an investigation of the scope of this problem in the context of disaster management, and novel message classification methods to overcome data heterogeneity based on ensemble methods, semi-supervised learning and feature selection. We evaluate the proposed methods on an academic benchmark dataset comprising twenty-six different disaster events, as well as in a case study where we assess the performance of the methods on real-world data. The experimental evaluation shows that the methods achieve quality of classification superior to methods previously used for this task
    • …
    corecore