895 research outputs found

    Adaptive Representations for Tracking Breaking News on Twitter

    Full text link
    Twitter is often the most up-to-date source for finding and tracking breaking news stories. Therefore, there is considerable interest in developing filters for tweet streams in order to track and summarize stories. This is a non-trivial text analytics task as tweets are short, and standard retrieval methods often fail as stories evolve over time. In this paper we examine the effectiveness of adaptive mechanisms for tracking and summarizing breaking news stories. We evaluate the effectiveness of these mechanisms on a number of recent news events for which manually curated timelines are available. Assessments based on ROUGE metrics indicate that an adaptive approaches are best suited for tracking evolving stories on Twitter.Comment: 8 Pag

    Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

    Full text link
    We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data

    Online indexing and clustering of social media data for emergency management

    Get PDF
    Social media becomes a vital part in our daily communication practice, creating a huge amount of data and covering different real-world situations. Currently, there is a tendency in making use of social media during emergency management and response. Most of this effort is performed by a huge number of volunteers browsing through social media data and preparing maps that can be used by professional first responders. Automatic analysis approaches are needed to directly support the response teams in monitoring and also understanding the evolution of facts in social media during an emergency situation. In this paper, we investigate the problem of real-time sub-events identification in social media data (i.e., Twitter, Flickr and YouTube) during emergencies. A processing framework is presented serving to generate situational reports/summaries from social media data. This framework relies in particular on online indexing and online clustering of media data streams. Online indexing aims at tracking the relevant vocabulary to capture the evolution of sub-events over time. Online clustering, on the other hand, is used to detect and update the set of sub-events using the indices built during online indexing. To evaluate the framework, social media data related to Hurricane Sandy 2012 was collected and used in a series of experiments. In particular some online indexing methods have been tested against a proposed method to show their suitability. Moreover, the quality of online clustering has been studied using standard clustering indices. Overall the framework provides a great opportunity for supporting emergency responders as demonstrated in real-world emergency exercises

    Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

    TweeProfiles4: a weighted multidimensional stream clustering algorithm

    Get PDF
    O aparecimento das redes sociais abriu aos utilizadores a possibilidade de facilmente partilharem as suas ideias a respeito de diferentes temas, o que constitui uma fonte de informação enriquecedora para diversos campos. As plataformas de microblogging sofreram um grande crescimento e de forma constante nos últimos anos. O Twitter é o site de microblogging mais popular, tornando-se uma fonte de dados interessante para extração de conhecimento. Um dos principais desafios na análise de dados provenientes de redes sociais é o seu fluxo, o que dificulta a aplicação de processos tradicionais de data mining. Neste sentido, a extração de conhecimento sobre fluxos de dados tem recebido um foco significativo recentemente. O TweeProfiles é a uma ferramenta de data mining para análise e visualização de dados do Twitter sobre quatro dimensões: espacial (a localização geográfica do tweet), temporal (a data de publicação do tweet), de conteúdo (o texto do tweet) e social (o grafo dos relacionamentos). Este é um projeto em desenvolvimento que ainda possui muitos aspetos que podem ser melhorados. Uma das recentes melhorias inclui a substituição do algoritmo de clustering original, o qual não suportava o fluxo contínuo dos dados, por um método de streaming. O objetivo desta dissertação passa pela continuação do desenvolvimento do TweeProfiles. Em primeiro lugar, será proposto um novo algoritmo de clustering para fluxos de dados com o objetivo de melhorar o existente. Para esse efeito será desenvolvido um algoritmo incremental com suporte para fluxos de dados multi-dimensionais. Esta abordagem deve permitir ao utilizador alterar dinamicamente a importância relativa de cada dimensão do processo de clustering. Adicionalmente, a avaliação empírica dos resultados será alvo de melhoramento através da identificação e implementação de medidas adequadas de avaliação dos padrões extraídos. O estudo empírico será realizado através de tweets georreferenciados obtidos pelo SocialBus.The emergence of social media made it possible for users to easily share their thoughts on different topics, which constitutes a rich source of information for many fields. Microblogging platforms experienced a large and steady growth over the last few years. Twitter is the most popular microblogging site, making it an interesting source of data for pattern extraction. One of the main challenges of analyzing social media data is its continuous nature, which makes it hard to use traditional data mining. Therefore, mining stream data has also received a lot of attention recently.TweeProfiles is a data mining tool for analyzing and visualizing Twitter data over four dimensions: spatial (the location of the tweet), temporal (the timestamp of the tweet), content (the text of the tweet) and social (relationship graph). This is an ongoing project which still has many aspects that can be improved. For instance, it was recently improved by replacing the original clustering algorithm which could not handle the continuous flow of data with a streaming method. The goal of this dissertation is to continue the development of TweeProfiles. First, the stream clustering process will be improved by proposing a new algorithm. This will be achieved by developing an incremental algorithm with support for multi-dimensional streaming data. Moreover, it should make it possible for the user to dynamically change the relative importance of each dimension in the clustering. Additionally, the empirical evaluation of the results will also be improved.Suitable measures to evaluate the extracted patterns will be identified and implemented. An empirical study will be done using data consisting of georeferenced tweets from SocialBus

    Making Sense of Social Events by Event monitoring, Visualization and Underlying Community Profiling

    Get PDF
    With the prevalence of intelligent devices, social networks have been playing an increasingly important role in our daily life. Various social networks (e.g., Twitter, Facebook) provide convenient platforms for users to explore the world. In this thesis, we study the problem of multi-perspective analysis of social events detected from social networks. In particular, we aim to make sense of the social events from the following three perspectives: 1) what are these social events about; 2) how do these events evolve along timeline; 3) who are involved in the discussions on these events. We mainly work on two categories of social data: the user-generated contents such as tweets and Facebook posts, and the users' interactions such as the follow and reply behaviours among users. On one hand, the posts reveal valuable information that describes the evolutions of miscellaneous social events, which is crucial for people to understand the world. On the other hand, users' interactions demonstrate users' relationships among each other and thus provide opportunities for analysing the underlying communities behind the social events. However, it is not practical to manually detect social events, monitor event evolutions or profile the underlying communities from the massive amount of social data generated everyday. Hence, how to efficiently and effectively extract, manage and analyse the useful information from the social data for multi-perspective social events understanding is of great importance. The social data is dynamic source of information which enables people to stay informed of what is happening now and who are the active and influential users discussing these social events. For one thing, social data is generated by people worldwide at all time, which may make fast identification of events even before the mainstream media. Moreover, the continuous stream of social data reflects the event evolutions and characterizes the events with changing opinions at different stages. This provides an opportunity to people for timely responses to urgent events. For another, users are often not isolated in social networks. The interactions between users can be utilized to discover the communities who discuss each social event. Underlying community profiling provides answers to the questions like who are interested in these events, and which group of people are the most influential users in spreading certain event topics. These answers deepen our understanding of the social events by considering not only the events themselves but also the users behind these events. The first research task in this thesis is to monitor and index the evolving events from social textual contents. The social data cover a wide variety of events which typically evolve over time. Although event detection has been actively studied, most existing approaches do not track the evolution of events, nor do they address the issue of efficient monitoring in the presence of a large number of events. In this task, we detect events based on the user-generated textual contents and design four event operations to capture the dynamics of events. Moreover, we propose a novel event indexing structure, called Multi-layer Inverted List, to manage dynamic event databases for the acceleration of large-scale event search and update. The second research task is to explore multiple features for social events tracking and visualization. In addition to textual contents utilized in the first task, social data contains various features, such as images and timestamps. The benefits of incorporating different features into event detection are twofold. First, these features provide supplemental information that facilitates the event detection model. Second, different features describe the detected events from different aspects, which enables users to have a better understanding with more vivid visualizations. To improve the event detection performance, we propose a novel generative probabilistic model which jointly models five different features. The event evolution tracking is achieved by applying the maximum-weighted bipartite graph matching on the events discovered in consecutive periods. Events are then visualized by the representative images selected based on our three defined criteria. The third research task is to detect and profile the underlying social communities in social events. The social data not only contains user-generated contents which describe the events evolutions, but also comprises various information on the users who discuss these events, such as user attributes, user behaviours, and so on. Comprehensively utilizing this user information can help to group similar users into communities, and enrich the social event analysis from the community perspective. Motivated by the rich semantics about user behaviours hidden in social data, we extend the community definition as a group of users who are not only densely connected, but also having similar behaviours. Moreover, in addition to detecting the communities, we further profile each of the detected communities for social events analysis. A novel community profiling model is designed to detect and characterize a community by both content profile (what a community is about) and diffusion profile (how it interacts with others)
    corecore