5,437 research outputs found

    Ordering-sensitive and Semantic-aware Topic Modeling

    Full text link
    Topic modeling of textual corpora is an important and challenging problem. In most previous work, the "bag-of-words" assumption is usually made which ignores the ordering of words. This assumption simplifies the computation, but it unrealistically loses the ordering information and the semantic of words in the context. In this paper, we present a Gaussian Mixture Neural Topic Model (GMNTM) which incorporates both the ordering of words and the semantic meaning of sentences into topic modeling. Specifically, we represent each topic as a cluster of multi-dimensional vectors and embed the corpus into a collection of vectors generated by the Gaussian mixture model. Each word is affected not only by its topic, but also by the embedding vector of its surrounding words and the context. The Gaussian mixture components and the topic of documents, sentences and words can be learnt jointly. Extensive experiments show that our model can learn better topics and more accurate word distributions for each topic. Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM obtains significantly better performance in terms of perplexity, retrieval accuracy and classification accuracy.Comment: To appear in proceedings of AAAI 201

    Structured sentiment analysis in social media

    Get PDF

    Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media Archives

    Get PDF
    How did the popularity of the Greek Prime Minister evolve in 2015? How did the predominant sentiment about him vary during that period? Were there any controversial sub-periods? What other entities were related to him during these periods? To answer these questions, one needs to analyze archived documents and data about the query entities, such as old news articles or social media archives. In particular, user-generated content posted in social networks, like Twitter and Facebook, can be seen as a comprehensive documentation of our society, and thus meaningful analysis methods over such archived data are of immense value for sociologists, historians and other interested parties who want to study the history and evolution of entities and events. To this end, in this paper we propose an entity-centric approach to analyze social media archives and we define measures that allow studying how entities were reflected in social media in different time periods and under different aspects, like popularity, attitude, controversiality, and connectedness with other entities. A case study using a large Twitter archive of four years illustrates the insights that can be gained by such an entity-centric and multi-aspect analysis.Comment: This is a preprint of an article accepted for publication in the International Journal on Digital Libraries (2018

    A Social Media Tax Data Warehouse to Manage the Underground Economy

    Full text link
    © 2018 IEEE. Social media can provide a wealth of information valuable to tax administrators in managing the underground economy. This paper proposes a data warehouse design to integrate social media data into tax analytics processes. The warehouse is designed to support modern tax administration strategies that encourage self-regulation and voluntary compliance by shaping public opinion, improving services and developing inclusive tax policies. The warehouse also incorporates the use of social media analytics to support tax evasion detection and enforcement activities such as compliance risk assessment, audits, inspections and investigations

    Inferring Dynamic User Interests in Streams of Short Texts for User Clustering

    Get PDF
    User clustering has been studied from different angles. In order to identify shared interests, behavior-based methods consider similar browsing or search patterns of users, whereas content-based methods use information from the contents of the documents visited by the users. So far, content-based user clustering has mostly focused on static sets of relatively long documents. Given the dynamic nature of social media, there is a need to dynamically cluster users in the context of streams of short texts. User clustering in this setting is more challenging than in the case of long documents, as it is difficult to capture the users’ dynamic topic distributions in sparse data settings. To address this problem, we propose a dynamic user clustering topic model (UCT). UCT adaptively tracks changes of each user’s time-varying topic distributions based both on the short texts the user posts during a given time period and on previously estimated distributions. To infer changes, we propose a Gibbs sampling algorithm where a set of word pairs from each user is constructed for sampling. UCT can be used in two ways: (1) as a short-term dependency model that infers a user’s current topic distribution based on the user’s topic distributions during the previous time period only, and (2) as a long-term dependency model that infers a user’s current topic distributions based on the user’s topic distributions during multiple time periods in the past. The clustering results are explainable and human-understandable, in contrast to many other clustering algorithms. For evaluation purposes, we work with a dataset consisting of users and tweets from each user. Experimental results demonstrate the effectiveness of our proposed short-term and long-term dependency user clustering models compared to state-of-the-art baselines

    Deriving Verb Predicates By Clustering Verbs with Arguments

    Full text link
    Hand-built verb clusters such as the widely used Levin classes (Levin, 1993) have proved useful, but have limited coverage. Verb classes automatically induced from corpus data such as those from VerbKB (Wijaya, 2016), on the other hand, can give clusters with much larger coverage, and can be adapted to specific corpora such as Twitter. We present a method for clustering the outputs of VerbKB: verbs with their multiple argument types, e.g. "marry(person, person)", "feel(person, emotion)." We make use of a novel low-dimensional embedding of verbs and their arguments to produce high quality clusters in which the same verb can be in different clusters depending on its argument type. The resulting verb clusters do a better job than hand-built clusters of predicting sarcasm, sentiment, and locus of control in tweets

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications
    • …
    corecore