459 research outputs found

    Event Detection from Social Media Stream: Methods, Datasets and Opportunities

    Full text link
    Social media streams contain large and diverse amount of information, ranging from daily-life stories to the latest global and local events and news. Twitter, especially, allows a fast spread of events happening real time, and enables individuals and organizations to stay informed of the events happening now. Event detection from social media data poses different challenges from traditional text and is a research area that has attracted much attention in recent years. In this paper, we survey a wide range of event detection methods for Twitter data stream, helping readers understand the recent development in this area. We present the datasets available to the public. Furthermore, a few research opportunitiesComment: 8 page

    Real-time Event Detection on Social Data Streams

    Full text link
    Social networks are quickly becoming the primary medium for discussing what is happening around real-world events. The information that is generated on social platforms like Twitter can produce rich data streams for immediate insights into ongoing matters and the conversations around them. To tackle the problem of event detection, we model events as a list of clusters of trending entities over time. We describe a real-time system for discovering events that is modular in design and novel in scale and speed: it applies clustering on a large stream with millions of entities per minute and produces a dynamically updated set of events. In order to assess clustering methodologies, we build an evaluation dataset derived from a snapshot of the full Twitter Firehose and propose novel metrics for measuring clustering quality. Through experiments and system profiling, we highlight key results from the offline and online pipelines. Finally, we visualize a high profile event on Twitter to show the importance of modeling the evolution of events, especially those detected from social data streams.Comment: Accepted as a full paper at KDD 2019 on April 29, 201

    Event detection and user interest discovering in social media data streams

    Get PDF
    Social media plays an increasingly important role in people’s life. Microblogging is a form of social media which allows people to share and disseminate real-life events. Broadcasting events in microblogging networks can be an effective method of creating awareness, divulging important information and so on. However, many existing approaches at dissecting the information content primarily discuss the event detection model and ignore the user interest which can be discovered during event evolution. This leads to difficulty in tracking the most important events as they evolve including identifying the influential spreaders. There is further complication given that the influential spreaders interests will also change during event evolution. The influential spreaders play a key role in event evolution and this has been largely ignored in traditional event detection methods. To this end, we propose a user-interest model based event evolution model, named the HEE (Hot Event Evolution) model. This model not only considers the user interest distribution, but also uses the short text data in the social network to model the posts and the recommend methods to discovering the user interests. This can resolve the problem of data sparsity, as exemplified by many existing event detection methods, and improve the accuracy of event detection. A hot event automatic filtering algorithm is initially applied to remove the influence of general events, improving the quality and efficiency of mining the event. Then an automatic topic clustering algorithm is applied to arrange the short texts into clusters with similar topics. An improved user-interest model is proposed to combine the short texts of each cluster into a long text document simplifying the determination of the overall topic in relation to the interest distribution of each user during the evolution of important events. Finally a novel cosine measure based event similarity detection method is used to assess correlation between events thereby detecting the process of event evolution. The experimental results on a real Twitter dataset demonstrate the efficiency and accuracy of our proposed model for both event detection and user interest discovery during the evolution of hot events.N/

    Analyse et application de la diffusion d'information dans les microblogs

    Get PDF
    Microblog service (such as Twitter and Sina Weibo) have become an important platform for Internet content sharing. As the information in Microblog are widely used in public opinion mining, viral marketing and political campaigns, understanding how information diffuses over Microblogs, and explaining the process through which some tweets become popular, are important.The analysis of the information diffusion in Microblogs involves the data collection from Microblog, the modeling on information spreading and using the resulting models. Dealing with the huge amount of data flowing through microblogs is by itself a challenge. Designing an efficient and unbiased sampling algorithm for Microblog is therefore essential. Besides, the retweeting process in Microblog is complex because of the ephemerality of information, the topology of Microblog network and the particular features (such as number of followers) of publisher and retweeters.Two traditional models have been used for information diffusion : Independent Cascades and Linear Threshold models. However no one of them can describe completely the retweeting process in Microblog accurately. The analysis and design of new models to characterize the information diffusion in Microblog is therefore necessary. Moreover, a comprehensive description of the correlation between the information diffusion in Microblog and the searching trends of keywords on search engines is lacking although some work has been found some preliminary relationships.This work presnets a complete analysis of information diffusion in Microblog from. The contributions and innovations of this thesis are as follows:1)There are two popular unbiased Online Social Network (OSN) sampling algorithms,Metropolis-Hastings Random Walk (MHRW) and Unbiased Sampling for Directed Social Graph (USDSG) method. However they are both likely to yield considerable self-sampling probabilities when applied to Microblogs where there is local. To solve this problem, I have modelled the process of OSN sampling as a Markov process and have deduced the sufficient and necessary conditions of unbiased sampling. Based on this unbiased conditions, I proposed an efficient and unbiased sampling algorithms, Unbiased Sampling method with Dummy Edges (USDE), which reduces strongly the self-sampling probabilities of MHRW. The experimental evaluation demonstrate thats the average node degree of samples of MHRW and USDSG is 2 - 4 times as high as the ground truth while USDE can provide the approximation of ground truth when the sampling repetitions are removed. Moreover the average sampling time per node in USDE is only a half of MHRW and USDSG one.2)A second contribution targets the shortages of Independent Cascades (IC) and Linear Threshold (LT) models in characterizing the retweeting process in Microblogs. I achieve this by introducing a Galton Watson with Killing (GWK) model which considers all the three important factors including the ephemerality of information, the topology of network and the features of publisher and retweeters accurately. We have validated the applicability of the of GWK model over two datasets from Sina Weibo and Twitter and showed that GWK model can fit 82% of information receivers and 90% of the maximum numbers of hops in the real retweeting process. Besides, the GWK model is useful for revealing the endogenous and exogenous factors which affect the popularity of tweets.3) Motivated by the correlation between popularity and trendiness of topicsin Microblog and search trends, I have developed an economic analysis of the market involving a third-party ad broker, which is a popular market in current SEM, and finds that the adwords augmenting strategy with the trending and popular topics in Twitter enables the broker to achieve, on average, four folds larger return on investment than with a non-augmented strategy, while still maintaining the same level of risk.Les services de microblogging (comme Twitter ou Sina Weibo) sont devenu ces dernières années des plateformes très importantes de partage d'information sur l'Internet. Les microblogs sont fréquemment utilisé pour l'analyse de l'opinion, le marketing viral, et les campagnes politiques. Comprendre les mécanismes sous-jacents de la diffusion d'information sur les microblogs et comment des contenus deviennent populaires est important.L‘analyse de la diffusion d'information dans les microblogs nécessite la collecte de donnée des microblogs, la modélisation de la diffusion d'information et l'application des modèles résultants. Traiter les données massives issues des microblogs est un défi en soi. Concevoir des algorithmes efficaces et sans biais afin d'échantillonner les microblogs est ainsi fondamental. Ceci doit prendre en compte la complexité du phénomène de « retweet » qui dépend de la valeur éphémère de l'information, de la topologie du réseau de microblogging et des caractéristiques particulières des éditeurs et retweeteurs.Deux modèles ont été traditionnellement appliqués à la diffusion d'information : les cascades indépendantes et modèle à seuil linéaire. Aucun de ces deux modèles n'est à même de décrire le processus du retweeting de façon correcte. Il devient donc nécessaire de de caractériser la diffusion d'information. De plus, une description complète de la relation entre la diffusion d'information dans les microblogs et de popularité des termes recherchés sur Internet serait utile.Ces travaux de thèse présentent une analyse complète de la diffusion d'information dans les microblogs. Les contributions ce cette thèse sont les suivantes :1) Il y'a deux technique d'échantillonnage sans biais pour les réseaux sociaux : la marche aléatoire de Métropolis-Hastings (MHRW), et la méthode d'échantillonnage sans biais de graphes dirigés (USDSG). Néanmoins ces deux méthodes peuvent aboutit à un taux important d'auto-échantillonnage quand elles sont appliquées à des microblogs. Pour résoudre ce problème, j'ai modélisé l'échantillonnage d'un OSN par un processus de Markov et j'en ai déduit les conditions nécessaires et suffisantes d'un échantillonnage sans biais. Ces conditions m'ont permis de proposer un algorithme d'échantillonnage sans biais et efficace que j'ai nommé : échantillonnage sans biais par liens vide (USDE). Cette nouvelle méthode d'échantillonage réduit fortement l'auto-échantillonnage du MHRW. L ‘évaluation empirique montre que la moyenne des dégrées des nœuds échantillonnés est proche de la vérité terrain alors que pour MHRW et USDSG elle est 2 à 4 fois supérieure.2) La seconde contribution de cette thèse vise les lacunes des modèles en cascades indépendantes et de seuils linéaires. J'ai développé un modèle fondé sur les processus de Galton-Watson avec mort (GWK) qui prennent en compte tous les facteurs importants du processus de retweet. Ce nouveau modèle est validé par une application sur des données issues de Twitter et de Weibo.3) La troisième contribution est relative au développement d'un modèle économique du marché des acteurs actifs dans le domaine du marketing sur les mots clés dans les sites de recherches. J'ai développé des méthodes de gestion de portfolios de mots clés et montrés que ces portfolios permettent d'améliorer fortement les rendements sans augmenter le niveau de risque

    VIRAL TOPIC PREDICTION AND DESCRIPTION IN MICROBLOG SOCIAL NETWORKS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Finding Bursty Topics From Microblogs

    Get PDF
    Microblogs such as Twitter reflect the general public’s reactions to major events. Bursty topics from microblogs reveal what events have attracted the most online attention. Although bursty event detection from text streams has been studied before, previous work may not be suitable for microblogs because compared with other text streams such as news articles and scientific publications, microblog posts are particularly diverse and noisy. To find topics that have bursty patterns on microblogs, we propose a topic model that simultaneously captures two observations: (1) posts published around the same time are more likely to have the same topic, and (2) posts published by the same user are more likely to have the same topic. The former helps find eventdriven posts while the latter helps identify and filter out “personal ” posts. Our experiments on a large Twitter dataset show that there are more meaningful and unique bursty topics in the top-ranked results returned by our model than an LDA baseline and two degenerate variations of our model. We also show some case studies that demonstrate the importance of considering both the temporal information and users ’ personal interests for bursty topic detection from microblogs.
    corecore