231 research outputs found

    TweeProfiles4: a weighted multidimensional stream clustering algorithm

    Get PDF
    O aparecimento das redes sociais abriu aos utilizadores a possibilidade de facilmente partilharem as suas ideias a respeito de diferentes temas, o que constitui uma fonte de informação enriquecedora para diversos campos. As plataformas de microblogging sofreram um grande crescimento e de forma constante nos últimos anos. O Twitter é o site de microblogging mais popular, tornando-se uma fonte de dados interessante para extração de conhecimento. Um dos principais desafios na análise de dados provenientes de redes sociais é o seu fluxo, o que dificulta a aplicação de processos tradicionais de data mining. Neste sentido, a extração de conhecimento sobre fluxos de dados tem recebido um foco significativo recentemente. O TweeProfiles é a uma ferramenta de data mining para análise e visualização de dados do Twitter sobre quatro dimensões: espacial (a localização geográfica do tweet), temporal (a data de publicação do tweet), de conteúdo (o texto do tweet) e social (o grafo dos relacionamentos). Este é um projeto em desenvolvimento que ainda possui muitos aspetos que podem ser melhorados. Uma das recentes melhorias inclui a substituição do algoritmo de clustering original, o qual não suportava o fluxo contínuo dos dados, por um método de streaming. O objetivo desta dissertação passa pela continuação do desenvolvimento do TweeProfiles. Em primeiro lugar, será proposto um novo algoritmo de clustering para fluxos de dados com o objetivo de melhorar o existente. Para esse efeito será desenvolvido um algoritmo incremental com suporte para fluxos de dados multi-dimensionais. Esta abordagem deve permitir ao utilizador alterar dinamicamente a importância relativa de cada dimensão do processo de clustering. Adicionalmente, a avaliação empírica dos resultados será alvo de melhoramento através da identificação e implementação de medidas adequadas de avaliação dos padrões extraídos. O estudo empírico será realizado através de tweets georreferenciados obtidos pelo SocialBus.The emergence of social media made it possible for users to easily share their thoughts on different topics, which constitutes a rich source of information for many fields. Microblogging platforms experienced a large and steady growth over the last few years. Twitter is the most popular microblogging site, making it an interesting source of data for pattern extraction. One of the main challenges of analyzing social media data is its continuous nature, which makes it hard to use traditional data mining. Therefore, mining stream data has also received a lot of attention recently.TweeProfiles is a data mining tool for analyzing and visualizing Twitter data over four dimensions: spatial (the location of the tweet), temporal (the timestamp of the tweet), content (the text of the tweet) and social (relationship graph). This is an ongoing project which still has many aspects that can be improved. For instance, it was recently improved by replacing the original clustering algorithm which could not handle the continuous flow of data with a streaming method. The goal of this dissertation is to continue the development of TweeProfiles. First, the stream clustering process will be improved by proposing a new algorithm. This will be achieved by developing an incremental algorithm with support for multi-dimensional streaming data. Moreover, it should make it possible for the user to dynamically change the relative importance of each dimension in the clustering. Additionally, the empirical evaluation of the results will also be improved.Suitable measures to evaluate the extracted patterns will be identified and implemented. An empirical study will be done using data consisting of georeferenced tweets from SocialBus

    Traveling Trends: Social Butterflies or Frequent Fliers?

    Full text link
    Trending topics are the online conversations that grab collective attention on social media. They are continually changing and often reflect exogenous events that happen in the real world. Trends are localized in space and time as they are driven by activity in specific geographic areas that act as sources of traffic and information flow. Taken independently, trends and geography have been discussed in recent literature on online social media; although, so far, little has been done to characterize the relation between trends and geography. Here we investigate more than eleven thousand topics that trended on Twitter in 63 main US locations during a period of 50 days in 2013. This data allows us to study the origins and pathways of trends, how they compete for popularity at the local level to emerge as winners at the country level, and what dynamics underlie their production and consumption in different geographic areas. We identify two main classes of trending topics: those that surface locally, coinciding with three different geographic clusters (East coast, Midwest and Southwest); and those that emerge globally from several metropolitan areas, coinciding with the major air traffic hubs of the country. These hubs act as trendsetters, generating topics that eventually trend at the country level, and driving the conversation across the country. This poses an intriguing conjecture, drawing a parallel between the spread of information and diseases: Do trends travel faster by airplane than over the Internet?Comment: Proceedings of the first ACM conference on Online social networks, pp. 213-222, 201

    SOTXTSTREAM: Density-based self-organizing clustering of text streams

    Get PDF
    A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

    Survival analysis on data streams: Analyzing temporal events in dynamically changing environments

    Get PDF
    In this paper, we introduce a method for survival analysis on data streams. Survival analysis (also known as event history analysis) is an established statistical method for the study of temporal “events” or, more specifically, questions regarding the temporal distribution of the occurrence of events and their dependence on covariates of the data sources. To make this method applicable in the setting of data streams, we propose an adaptive variant of a model that is closely related to the well-known Cox proportional hazard model. Adopting a sliding window approach, our method continuously updates its parameters based on the event data in the current time window. As a proof of concept, we present two case studies in which our method is used for different types of spatio-temporal data analysis, namely, the analysis of earthquake data and Twitter data. In an attempt to explain the frequency of events by the spatial location of the data source, both studies use the location as covariates of the sources

    Localized Events in Social Media Streams: Detection, Tracking, and Recommendation

    Get PDF
    From the recent proliferation of social media channels to the immense amount of user-generated content, an increasing interest in social media mining is currently being witnessed. Messages continuously posted via these channels report a broad range of topics from daily life to global and local events. As a consequence, this has opened new opportunities for mining event information crucial in many application domains, especially in increasing the situational awareness in critical scenarios. Interestingly, many of these messages are enriched with location information, due to the wide- spread of mobile devices and the recent advancements of today’s location acquisition techniques. This enables location-aware event mining, i.e., the detection and tracking of localized events. In this thesis, we propose novel frameworks and models that digest social media content for localized event detection, tracking, and recommendation. We first develop KeyPicker, a framework to extract and score event-related keywords in an online fashion, accounting for high levels of noise, temporal heterogeneity and outliers in the data. Then, LocEvent is proposed to incrementally detect and track events using a 4-stage procedure. That is, LocEvent receives the keywords extracted by KeyPicker, identifies local keywords, spatially clusters them, and finally scores the generated clusters. For each detected event, a set of descriptive keywords, a location, and a time interval are estimated at a fine-grained resolution. In addition to the sparsity of geo-tagged messages, people sometimes post about events far away from an event’s location. Such spatial problems are handled by novel spatial regularization techniques, namely, graph- and gazetteer-based regularization. To ensure scalability, we utilize a hierarchical spatial index in addition to a multi-stage filtering procedure that gradually suppresses noisy words and considers only event-related ones for complex spatial computations. As for recommendation applications, we propose an event recommender system built upon model-based collaborative filtering. Our model is able to suggest events to users, taking into account a number of contextual features including the social links between users, the topical similarities of events, and the spatio-temporal proximity between users and events. To realize this model, we employ and adapt matrix factorization, which allows for uncovering latent user-event patterns. Our proposed features contribute to directing the learning process towards recommendations that better suit the taste of users, in particular when new users have very sparse (or even no) event attendance history. To evaluate the effectiveness and efficiency of our proposed approaches, extensive comparative experiments are conducted using datasets collected from social media channels. Our analysis of the experimental results reveals the superiority and advantages of our frameworks over existing methods in terms of the relevancy and precision of the obtained results
    corecore