15,930 research outputs found

    Triple Non-negative Matrix Factorization Technique for Sentiment Analysis and Topic Modeling

    Get PDF
    Topic modeling refers to the process of algorithmically sorting documents into categories based on some common relationship between the documents. This common relationship between the documents is considered the “topic” of the documents. Sentiment analysis refers to the process of algorithmically sorting a document into a positive or negative category depending whether this document expresses a positive or negative opinion on its respective topic. In this paper, I consider the open problem of document classification into a topic category, as well as a sentiment category. This has a direct application to the retail industry where companies may want to scour the web in order to find documents (blogs, Amazon reviews, etc.) which both speak about their product, and give an opinion on their product (positive, negative or neutral). My solution to this problem uses a Non-negative Matrix Factorization (NMF) technique in order to determine the topic classifications of a document set, and further factors the matrix in order to discover the sentiment behind this category of product

    Topic Analysis of Tweets on the European Refugee Crisis Using Non-negative Matrix Factorization

    Get PDF
    The ongoing European Refugee Crisis has been one of the most popular trending topics on Twitter for the past 8 months. This paper applies topic modeling on bulks of tweets to discover the hidden patterns within these social media discussions. In particular, we perform topic analysis through solving Non-negative Matrix Factorization (NMF) as an Inexact Alternating Least Squares problem. We accelerate the computation using techniques including tweet sampling and augmented NMF, compare NMF results with different ranks and visualize the outputs through topic representation and frequency plots. We observe that supportive sentiments maintained a strong presence while negative sentiments such as safety concerns have emerged over time

    Memory-Efficient Topic Modeling

    Full text link
    As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure

    Statistical Traffic State Analysis in Large-scale Transportation Networks Using Locality-Preserving Non-negative Matrix Factorization

    Get PDF
    Statistical traffic data analysis is a hot topic in traffic management and control. In this field, current research progresses focus on analyzing traffic flows of individual links or local regions in a transportation network. Less attention are paid to the global view of traffic states over the entire network, which is important for modeling large-scale traffic scenes. Our aim is precisely to propose a new methodology for extracting spatio-temporal traffic patterns, ultimately for modeling large-scale traffic dynamics, and long-term traffic forecasting. We attack this issue by utilizing Locality-Preserving Non-negative Matrix Factorization (LPNMF) to derive low-dimensional representation of network-level traffic states. Clustering is performed on the compact LPNMF projections to unveil typical spatial patterns and temporal dynamics of network-level traffic states. We have tested the proposed method on simulated traffic data generated for a large-scale road network, and reported experimental results validate the ability of our approach for extracting meaningful large-scale space-time traffic patterns. Furthermore, the derived clustering results provide an intuitive understanding of spatial-temporal characteristics of traffic flows in the large-scale network, and a basis for potential long-term forecasting.Comment: IET Intelligent Transport Systems (2013

    A Comparison of Different Topic Modeling Methods through a Real Case Study of Italian Customer Care

    Get PDF
    The paper deals with the analysis of conversation transcriptions between customers and agents in a call center of a customer care service. The objective is to support the analysis of text transcription of human-to-human conversations, to obtain reports on customer problems and complaints, and on the way an agent has solved them. The aim is to provide customer care service with a high level of efficiency and user satisfaction. To this aim, topic modeling is considered since it facilitates insightful analysis from large documents and datasets, such as a summarization of the main topics and topic characteristics. This paper presents a performance comparison of four topic modeling algorithms: (i) Latent Dirichlet Allocation (LDA); (ii) Non-negative Matrix Factorization (NMF); (iii) Neural-ProdLDA (Neural LDA) and Contextualized Topic Models (CTM). The comparison study is based on a database containing real conversation transcriptions in Italian Natural Language. Experimental results and different topic evaluation metrics are analyzed in this paper to determine the most suitable model for the case study. The gained knowledge can be exploited by practitioners to identify the optimal strategy and to perform and evaluate topic modeling on Italian natural language transcriptions of human-to-human conversations. This work can be an asset for grounding applications of topic modeling and can be inspiring for similar case studies in the domain of customer care quality

    Topic supervised non-negative matrix factorization

    Get PDF
    Topic models have been extensively used to organize and interpret the contents of large, unstructured corpora of text documents. Although topic models often perform well on traditional training vs. test set evaluations, it is often the case that the results of a topic model do not align with human interpretation. This interpretability fallacy is largely due to the unsupervised nature of topic models, which prohibits any user guidance on the results of a model. In this paper, we introduce a semi-supervised method called topic supervised non-negative matrix factorization (TS-NMF) that enables the user to provide labeled example documents to promote the discovery of more meaningful semantic structure of a corpus. In this way, the results of TS-NMF better match the intuition and desired labeling of the user. The core of TS-NMF relies on solving a non-convex optimization problem for which we derive an iterative algorithm that is shown to be monotonic and convergent to a local optimum. We demonstrate the practical utility of TS-NMF on the Reuters and PubMed corpora, and find that TS-NMF is especially useful for conceptual or broad topics, where topic key terms are not well understood. Although identifying an optimal latent structure for the data is not a primary objective of the proposed approach, we find that TS-NMF achieves higher weighted Jaccard similarity scores than the contemporary methods, (unsupervised) NMF and latent Dirichlet allocation, at supervision rates as low as 10% to 20%

    How Many Topics? Stability Analysis for Topic Models

    Full text link
    Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification
    corecore