21,917 research outputs found

    Latent dirichlet markov allocation for sentiment analysis

    Get PDF
    In recent years probabilistic topic models have gained tremendous attention in data mining and natural language processing research areas. In the field of information retrieval for text mining, a variety of probabilistic topic models have been used to analyse content of documents. A topic model is a generative model for documents, it specifies a probabilistic procedure by which documents can be generated. All topic models share the idea that documents are mixture of topics, where a topic is a probability distribution over words. In this paper we describe Latent Dirichlet Markov Allocation Model (LDMA), a new generative probabilistic topic model, based on Latent Dirichlet Allocation (LDA) and Hidden Markov Model (HMM), which emphasizes on extracting multi-word topics from text data. LDMA is a four-level hierarchical Bayesian model where topics are associated with documents, words are associated with topics and topics in the model can be presented with single- or multi-word terms. To evaluate performance of LDMA, we report results in the field of aspect detection in sentiment analysis, comparing to the basic LDA model

    A Spectral Algorithm for Latent Dirichlet Allocation

    Full text link
    The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k×kk\times k matrices, where kk is the number of latent factors (e.g. the number of topics), rather than in the dd-dimensional observed space (typically dkd \gg k).Comment: Changed title to match conference version, which appears in Advances in Neural Information Processing Systems 25, 201

    The Ensemble MESH-Term Query Expansion Models Using Multiple LDA Topic Models and ANN Classifiers in Health Information Retrieval

    Get PDF
    Information retrieval in the health field has several challenges. Health information terminology is difficult for consumers (laypeople) to understand. Formulating a query with professional terms is not easy for consumers because health-related terms are more familiar to health professionals. If health terms related to a query are automatically added, it would help consumers to find relevant information. The proposed query expansion (QE) models show how to expand a query using MeSH (Medical Subject Headings) terms. The documents were represented by MeSH terms (i.e. Bag-of-MeSH), which were included in the full-text articles. And then the MeSH terms were used to generate LDA (Latent Dirichlet Analysis) topic models. A query and the top k retrieved documents were used to find MeSH terms as topic words related to the query. LDA topic words were filtered by 1) threshold values of topic probability (TP) and word probability (WP) or 2) an ANN (Artificial Neural Network) classifier. Threshold values were effective in an LDA model with a specific number of topics to increase IR performance in terms of infAP (inferred Average Precision) and infNDCG (inferred Normalized Discounted Cumulative Gain), which are common IR metrics for large data collections with incomplete judgments. The top k words were chosen by the word score based on (TP *WP) and retrieved document ranking in an LDA model with specific thresholds. The QE model with specific thresholds for TP and WP showed improved mean infAP and infNDCG scores in an LDA model, comparing with the baseline result. However, the threshold values optimized for a particular LDA model did not perform well in other LDA models with different numbers of topics. An ANN classifier was employed to overcome the weakness of the QE model depending on LDA thresholds by automatically categorizing MeSH terms (positive/negative/neutral) for QE. ANN classifiers were trained on word features related to the LDA model and collection. Two types of QE models (WSW & PWS) using an LDA model and an ANN classifier were proposed: 1) Word Score Weighting (WSW) where the probability of being a positive/negative/neutral word was used to weight the original word score, and 2) Positive Word Selection (PWS) where positive words were identified by the ANN classifier. Forty WSW models showed better average mean infAP and infNDCG scores than the PWS models when the top 7 words were selected for QE. Both approaches based on a binary ANN classifier were effective in increasing infAP and infNDCG, statistically, significantly, compared with the scores of the baseline run. A 3-class classifier performed worse than the binary classifier. The proposed ensemble QE models integrated multiple ANN classifiers with multiple LDA models. Ensemble QE models combined multiple WSW/PWS models and one or multiple classifiers. Multiple classifiers were more effective in selecting relevant words for QE than one classifier. In ensemble QE (WSW/PWS) models, the top k words added to the original queries were effective to increase infAP and infNDCG scores. The ensemble QE model (WSW) using three classifiers showed statistically significant improvements for infAP and infNDCG in the mean scores for 30 queries when the top 3 words were added. The ensemble QE model (PWS) using four classifiers showed statistically significant improvements for 30 queries in the mean infAP and infNDCG scores

    Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

    Get PDF
    BACKGROUND: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. RESULTS: An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. CONCLUSION: Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation

    On some provably correct cases of variational inference for topic models

    Full text link
    Variational inference is a very efficient and popular heuristic used in various forms in the context of latent variable models. It's closely related to Expectation Maximization (EM), and is applied when exact EM is computationally infeasible. Despite being immensely popular, current theoretical understanding of the effectiveness of variaitonal inference based algorithms is very limited. In this work we provide the first analysis of instances where variational inference algorithms converge to the global optimum, in the setting of topic models. More specifically, we show that variational inference provably learns the optimal parameters of a topic model under natural assumptions on the topic-word matrix and the topic priors. The properties that the topic word matrix must satisfy in our setting are related to the topic expansion assumption introduced in (Anandkumar et al., 2013), as well as the anchor words assumption in (Arora et al., 2012c). The assumptions on the topic priors are related to the well known Dirichlet prior, introduced to the area of topic modeling by (Blei et al., 2003). It is well known that initialization plays a crucial role in how well variational based algorithms perform in practice. The initializations that we use are fairly natural. One of them is similar to what is currently used in LDA-c, the most popular implementation of variational inference for topic models. The other one is an overlapping clustering algorithm, inspired by a work by (Arora et al., 2014) on dictionary learning, which is very simple and efficient. While our primary goal is to provide insights into when variational inference might work in practice, the multiplicative, rather than the additive nature of the variational inference updates forces us to use fairly non-standard proof arguments, which we believe will be of general interest.Comment: 46 pages, Compared to previous version: clarified notation, a number of typos fixed throughout pape

    Rancang Bangun Perangkat Lunak Untuk Pendeteksian Topik Konseling Text Menggunakan Pemodelan Gaussian Latent Dirichlet Allocation (Studi Kasus : Riliv)

    Get PDF
    Riliv merupakan platform konseling online yang menghubungkan pengguna dengan masalah pribadi kepada psikolog secara online. Sebagai perusahaan penyedia jasa dan layanan konseling, Riliv berusaha untuk mengutamakan kualitas pelayanan dan kepuasan pelanggan. Riliv memiliki permasalahan dimana para psikolog menghabiskan banyak waktu di awal konseling untuk memahami topik permasalahan dari pengguna sehingga menyebabkan pengguna harus menunggu respon awal dari psikolog dengan lebih lama. Maka dari itu, penelitian ini akan melakukan analisis topic modeling pada konseling yang telah diselesaikan oleh Riliv untuk mengetahui topik-topik apa saja yang sering disampaikan oleh penggunanya. Analisis topik dilakukan menggunakan metode Gaussian Latent Dirichlet Allocation dan dievaluasi dengan pengujian perplexity untuk mengetahui kualitas dari model dalam melakukan generalisasi terhadap dokumen. Pengukuran uji koherensi dilakukan dengan perhitungan Pointwise Mutual Information (PMI) untuk menganalisis tingkat kesamaan semantic antara kata-kata yang terdapat didalam topik. Pemodelan topik juga dilakukan dengan menggunakan metode Latent Dirichlet Allocation (LDA) sebagai pembanding dari hasil Gaussian LDA. Kedua model selanjutnya dibandingkan dengan mengukur tingkat coherence masing-masing model menggunakan perhitungan Pointwise Mutual Information (PMI). Probabilitas kata-kata dalam setiap topik yang dihasilkan dari pemodelan terbaik dengan metode Gasusian LDA dan LDA selanjutnya digunakan untuk melakukan pendeteksian terhadap probabilitas topik-topik pada konseling yang baru. Berdasarkan hasil eksperimen pemodelan topik yang dilakukan, dapat disimpulkan bahwa skenario data dengan stemming pada metode pemodelan Gaussian LDA 25 topik menghasilkan model dengan nilai coherence tertinggi dengan perhitungan PMI, yaitu sebesar 3.0286. ========================================================================================================= Riliv is online counseling platform that connecting people with personal problems to psychologist by online. As a probider of counseling services, Riliv trying to give priority of services quality and customer satisfaction. Riliv has a problem that psychologists a lot of time in the early counseling to understand the topic of the problem from user thus cousing the user to wait for an earlier response from the psychologist for longer. Therefore, this research will do the topic modeling analysis on counseling that has been completed by Riliv to find out topics are often delivered by users. The topic analysis done using the Gaussian Latent Dirichlet Allocation method and evaluated by the plerplexity test to determine the quality of the model in generalizing the document. Measurement of coherence test is done by calculationg Pointwise Mutual Information (PMI) for the semantic degree of similarity between the words contained in the topic. Topic modeling is also done using Latent Dirichlet Allocation method as a comparios of Gaussian LDA result. The two models are compared by measuring the coherence level of each model using PMI. The probability of words in each of the topics resulting form Gaussian LDA and LDA methods will be used to detect the probability of topics in new counseling. Based on the experimental result of topic modeling, concluded that the data scenario with stemming one the Gaussian LDA 25 topics modeling method resulted have highest coherence value by PMI, which is 3.0286
    corecore