8 research outputs found

    Modeling and OLAPing social media : the case of Twitter

    Get PDF
    In the recent year, social networks have revolutionized the ways of interacting and exchanging information on the Internet. Millions of users interact frequently and share variety of digital content with each other. They express their feelings and opinions on every topic of interest. These opinions carry import value for personal, academic, and commercial applications, but the volume and the speed at which these are produced make it a challenging task for researchers and the underlying technologies to provide useful insights into such data. We attempt to extend the established online analytical processing (OLAP) technology to allow multidimensional analysis of social media data. In this paper, we pursue a goal of providing a generic multidimensional model dedicated to the OLAP of social media and specially Twitter. The proposed model reflects on some specifics such as recursive references between tweets, Empty dimension, and different types of hierarchies. It is implemented using NetBeans IDE platform. We present also some experimental results. We expect our proposed approach to be applicable for analyzing the data of other social networks as well

    Literature Explorer: effective retrieval of scientific documents through nonparametric thematic topic detection

    Get PDF
    © 2020 The Authors. Published by Springer. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://doi.org/10.1007/s00371-019-01721-7Scientific researchers are facing a rapidly growing volume of literatures nowadays. While these publications offer rich and valuable information, the scale of the datasets makes it difficult for the researchers to manage and search for desired information efficiently. Literature Explorer is a new interactive visual analytics suite that facilitates the access to desired scientific literatures through mining and interactive visualisation. We propose a novel topic mining method that is able to uncover “thematic topics” from a scientific corpus. These thematic topics have an explicit semantic association to the research themes that are commonly used by human researchers in scientific fields, and hence are human interpretable. They also contribute to effective document retrieval. The visual analytics suite consists of a set of visual components that are closely coupled with the underlying thematic topic detection to support interactive document retrieval. The visual components are adequately integrated under the design rationale and goals. Evaluation results are given in both objective measurements and subjective terms through expert assessments. Comparisons are also made against the outcomes from the traditional topic modelling methods.This research is supported by the European Commission with project Dr Inventor (No 611383), MyHealthAvatar (No 60929), and by the UK Engineering and Physical Sciences Research Council with project MyLifeHub (EP/L023830/1).Published onlin

    Topic models for short text data

    Get PDF
    Topic models are known to suffer from sparsity when applied to short text data. The problem is caused by a reduced number of observations available for a reliable inference (i.e.: the words in a document). A popular heuristic utilized to overcome this problem is to perform before training some form of document aggregation by context (e.g.: author, hashtag). We dedicated one part of this dissertation to modeling explicitly the implicit assumptions of the document aggregation heuristic and applying it to two well known model architectures: a mixture and an admixture. Our findings indicate that an admixture model benefits more from aggregation compared to a mixture model which rarely improved over its baseline (the standard mixture). We also find that the state of the art in short text data can be surpassed as long as every context is shared by a small number of documents. In the second part of the dissertation we develop a more general purpose topic model which can also be used when contextual information is not available. The proposed model is formulated around the observation that in normal text data, a classic topic model like an admixture works well because patterns of word co-occurrences arise across the documents. However, the possibility of such patterns to arise in a short text dataset is reduced. The model assumes every document is a bag of word co-occurrences, where each co-occurrence belongs to a latent topic. The documents are enhanced a priori with related co-occurrences from the other documents, such that the collection will have a greater chance of exhibiting word patterns. The proposed model performs well managing to surpass the state of the art and popular topic model baselines

    Topic models for short text data

    Get PDF
    Topic models are known to suffer from sparsity when applied to short text data. The problem is caused by a reduced number of observations available for a reliable inference (i.e.: the words in a document). A popular heuristic utilized to overcome this problem is to perform before training some form of document aggregation by context (e.g.: author, hashtag). We dedicated one part of this dissertation to modeling explicitly the implicit assumptions of the document aggregation heuristic and applying it to two well known model architectures: a mixture and an admixture. Our findings indicate that an admixture model benefits more from aggregation compared to a mixture model which rarely improved over its baseline (the standard mixture). We also find that the state of the art in short text data can be surpassed as long as every context is shared by a small number of documents. In the second part of the dissertation we develop a more general purpose topic model which can also be used when contextual information is not available. The proposed model is formulated around the observation that in normal text data, a classic topic model like an admixture works well because patterns of word co-occurrences arise across the documents. However, the possibility of such patterns to arise in a short text dataset is reduced. The model assumes every document is a bag of word co-occurrences, where each co-occurrence belongs to a latent topic. The documents are enhanced a priori with related co-occurrences from the other documents, such that the collection will have a greater chance of exhibiting word patterns. The proposed model performs well managing to surpass the state of the art and popular topic model baselines

    Leveraging multi-dimensional, multi-source knowledge for user preference modeling and event summarization in social media

    Get PDF
    An unprecedented development of various kinds of social media platforms, such as Twitter, Facebook and Foursquare, has been witnessed in recent years. This huge amount of user generated data are multi-dimensional in nature. Some dimensions are explicitly observed such as user profiles, text of social media posts, time, and location information. Others can be implicit and need to be inferred, reflecting the inherent structures of social media data. Examples include popular topics discussed in Twitter or Facebook, or the geographical clusters based on user check-in activities from Foursquare. It is of great interest to both research communities and commercial organizations to understand such heterogeneous data and leverage available information from multiple dimensions to facilitate social media applications, such as user preference modeling and event summarization. This dissertation first presents a general discriminative learning approach for modeling multi-dimensional knowledge in a supervised setting. A learning protocol is established to model both explicit and implicit knowledge in a unified manner, which applies to general classification/prediction tasks. This approach accommodates heterogeneous data dimensions with a significant boosted expressiveness of existing discriminative learning approaches. It stands out with its capability to model latent features, for which arbitrary generative assumptions are allowed. Besides the multi-dimensional nature, social media data are unstructured, fragmented and noisy. It makes social media data mining even more challenging that a lot of real applications come with no available annotation in an unsupervised setting. This dissertation addresses this issue from a novel angle: external sources such as news media and knowledge bases are exploited to provide supervision. I describe a unified framework which links traditional news data to Twitter and enables effective knowledge discovery such as event detection and summarization

    W.: Dynamic multi-faceted topic discovery in twitter

    No full text
    ABSTRACT Microblogging platforms, such as Twitter, already play an important role in cultural, social and political events around the world. Discovering high-level topics from social streams is therefore important for many downstream applications. However, traditional text mining methods that rely on the bag-of-words model are insufficient to uncover the rich semantics and temporal aspects of topics in Twitter. In particular, topics in Twitter are inherently dynamic and often focus on specific entities, such as people or organizations. In this paper, we therefore propose a method for mining multifaceted topics from Twitter streams. The Multi-Faceted Topic Model (MfTM) is proposed to jointly model latent semantics among terms and entities and captures the temporal characteristics of each topic. We develop an efficient online inference method for MfTM, which enables our model to be applied to large-scale and streaming data. Our experimental evaluation shows the effectiveness and efficiency of our model compared with state-of-the-art baselines. We further demonstrate the effectiveness of our framework in the context of tweet clustering
    corecore