313 research outputs found

    Dynamic hyperparameter optimization for bayesian topical trend analysis

    Get PDF
    This paper presents a new Bayesian topical trend analysis. We regard the parameters of topic Dirichlet priors in latent Dirichlet allocation as a function of document timestamps and optimize the parameters by a gradient-based algorithm. Since our method gives similar hyperparameters to the documents having similar timestamps, topic assignment in collapsed Gibbs sampling is affected by timestamp similarities. We compute TFIDF-based document similarities by using a result of collapsed Gibbs sampling and evaluate our proposal by link detection task of Topic Detection and Tracking.Proceeding of the 18th ACM conference : Hong Kong, China, 2009.11.02-2009.11.0

    A network approach to topic models

    Full text link
    One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io

    Syntactic Topic Models

    Full text link
    The syntactic topic model (STM) is a Bayesian nonparametric model of language that discovers latent distributions of words (topics) that are both semantically and syntactically coherent. The STM models dependency parsed corpora where sentences are grouped into documents. It assumes that each word is drawn from a latent topic chosen by combining document-level features and the local syntactic context. Each document has a distribution over latent topics, as in topic models, which provides the semantic consistency. Each element in the dependency parse tree also has a distribution over the topics of its children, as in latent-state syntax models, which provides the syntactic consistency. These distributions are convolved so that the topic of each word is likely under both its document and syntactic context. We derive a fast posterior inference algorithm based on variational methods. We report qualitative and quantitative studies on both synthetic data and hand-parsed documents. We show that the STM is a more predictive model of language than current models based only on syntax or only on topics

    Quantifying Priorities in Business Cycle Reports: Analysis of Recurring Textual Patterns around Peaks and Troughs

    Get PDF
    I propose a novel approach to uncover business cycle reports’ priorities and relate them to economic fluctuations. To this end, I leverage quantitative business-cycle forecasts published by leading German economic research institutes since 1970 to estimate the proportions of latent topics in associated business cycle reports. I then employ a supervised approach to aggregate topics with similar themes, thus revealing the proportions of broader macroeconomic subjects. I obtain measures of forecasters’ subject priorities by extracting the subject proportions’ cyclic components. Correlating these priorities with key macroeconomic variables reveals consistent priority patterns throughout economic peaks and troughs. The forecasters prioritize inflation-related matters over recession-related considerations around peaks. This finding suggests that forecasters underestimate growth and overestimate inflation risks during contractive monetary policies, which might explain their failure to predict recessions. Around troughs, forecasters prioritize investment matters, potentially suggesting a better understanding of macroeconomic developments during those periods compared to peaks

    Topic Uncovering and Image Annotation via Scalable Probit Normal Correlated Topic Models

    Get PDF
    Topic uncovering of the latent topics have become an active research area for more than a decade and continuous to receive contributions from all disciplines including computer science, information science and statistics. Since the introduction of Latent Dirichlet Allocation in 2003, many intriguing extension models have been proposed. One such extension model is the logistic normal correlated topic model, which not only uncovers hidden topic of a document, but also extract a meaningful topical relationship among a large number of topics. In this model, the Logistic normal distribution was adapted via the transformation of multivariate Gaussian variables to model the topical distribution of documents in the presence of correlations among topics. In this thesis, we propose a Probit normal alternative approach to modelling correlated topical structures. Our use of the Probit model in the context of topic discovery is novel, as many authors have so far concentrated solely of the logistic model partly due to the formidable inefficiency of the multinomial Probit model even in the case of very small topical spaces. We herein circumvent the inefficiency of multinomial Probit estimation by using an adaptation of the Diagonal Orthant Multinomial Probit (DO-Probit) in the topic models context, resulting in the ability of our topic modelling scheme to handle corpuses with a large number of latent topics. In addition, we extended our model and implement it into the context of image annotation by developing an efficient Collapsed Gibbs Sampling scheme. Furthermore, we employed various high performance computing techniques such as memory-aware Map Reduce, SpareseLDA implementation, vectorization and block sampling as well as some numerical efficiency strategy to allow fast and efficient sampling of our algorithm

    Modeling topical trends over continuous time with priors

    Get PDF
    In this paper, we propose a new method for topical trend analysis. We model topical trends by per-topic Beta distributions as in Topics over Time (TOT), proposed as an extension of latent Dirichlet allocation (LDA). However, TOT is likely to overfit to timestamp data in extracting latent topics. Therefore, we apply prior distributions to Beta distributions in TOT. Since Beta distribution has no conjugate prior, we devise a trick, where we set one among the two parameters of each per-topic Beta distribution to one based on a Bernoulli trial and apply Gamma distribution as a conjugate prior. Consequently, we can marginalize out the parameters of Beta distributions and thus treat timestamp data in a Bayesian fashion. In the evaluation experiment, we compare our method with LDA and TOT in link detection task on TDT4 dataset. We use word predictive probabilities as term weights and estimate document similarities by using those weights in a TFIDF-like scheme. The results show that our method achieves a moderate fitting to timestamp data.Advances in Neural Networks - ISNN 2010 : 7th International Symposium on Neural Networks, ISNN 2010, Shanghai, China, June 6-9, 2010, Proceedings, Part IIThe original publication is available at www.springerlink.co
    • …
    corecore