904 research outputs found

    Dirichlet belief networks for topic structure learning

    Full text link
    Recently, considerable research effort has been devoted to developing deep architectures for topic models to learn topic structures. Although several deep models have been proposed to learn better topic proportions of documents, how to leverage the benefits of deep structures for learning word distributions of topics has not yet been rigorously studied. Here we propose a new multi-layer generative process on word distributions of topics, where each layer consists of a set of topics and each topic is drawn from a mixture of the topics of the layer above. As the topics in all layers can be directly interpreted by words, the proposed model is able to discover interpretable topic hierarchies. As a self-contained module, our model can be flexibly adapted to different kinds of topic models to improve their modelling accuracy and interpretability. Extensive experiments on text corpora demonstrate the advantages of the proposed model.Comment: accepted in NIPS 201

    A Spectral Algorithm for Latent Dirichlet Allocation

    Full text link
    The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k×kk\times k matrices, where kk is the number of latent factors (e.g. the number of topics), rather than in the dd-dimensional observed space (typically dkd \gg k).Comment: Changed title to match conference version, which appears in Advances in Neural Information Processing Systems 25, 201

    The Discrete Infinite Logistic Normal Distribution

    Full text link
    We present the discrete infinite logistic normal distribution (DILN), a Bayesian nonparametric prior for mixed membership models. DILN is a generalization of the hierarchical Dirichlet process (HDP) that models correlation structure between the weights of the atoms at the group level. We derive a representation of DILN as a normalized collection of gamma-distributed random variables, and study its statistical properties. We consider applications to topic modeling and derive a variational inference algorithm for approximate posterior inference. We study the empirical performance of the DILN topic model on four corpora, comparing performance with the HDP and the correlated topic model (CTM). To deal with large-scale data sets, we also develop an online inference algorithm for DILN and compare with online HDP and online LDA on the Nature magazine, which contains approximately 350,000 articles.Comment: This paper will appear in Bayesian Analysis. A shorter version of this paper appeared at AISTATS 2011, Fort Lauderdale, FL, US

    A CASE STUDY ON THE EFFECT OF NEWS ON CRUDE OIL PRICE

    Get PDF
    Crude oil price volatility has an impact on the global economy and oil-dependent industries and is influenced by supply and demand, geopolitical tensions, and the global economy. Every day, a massive amount of textual information flows in the form of news articles, which humans use to forecast future trends. News articles can have a significant impact on the price of crude oil because they contain information about recent events, trends, and advancements in the industry. The purpose of this work is to investigate how news articles may affect crude oil prices, using the concept of topic modeling and its potential for handling data. Using the webscraping method, the data for the study comes from a large dataset of news articles about the crude oil industry. These news articles were published between January 1 and December 31, 2022, and come from four different sources. The data was compiled using the source Exchange Rates UK to demonstrate how the price of crude oil fluctuated during this period. After the cleaning process was completed, the dataset contained a total of 1532 news articles. The Latent Dirichlet Allocation (LDA) technique is suggested for extracting relevant keywords from news articles and then using the findings as input features to forecast the crude oil price. The forecasting methods employed in the study were the Ridge model, the Random Forest and XGBoost techniques, and the time series method ARIMAX. The outcomes of the experiment indicate that the association between the meaning of the news articles and the crude oil price is not sufficiently strong. It is additionally concluded that the XGBoost algorithm reveal superior predictive performance in the training set. As a result, XGBoost models for each month of 2022 were developed to investigate the impact of features and determine the most important ones for the problem.A volatilidade dos preços do petróleo bruto tem um impacto na economia global e nas indústrias dependentes do petróleo e é influenciada pela oferta e procura, por tensões geopolíticas e pela economia global. Todos os dias uma enorme quantidade de informação flui sob a forma de artigos de notícias e é utilizada pelo ser humano para prever tendências futuras. Os artigos de notícias podem influenciar significativamente o preço do petróleo bruto porque contêm informação sobre eventos recentes, tendências e avanços na indústria. O objetivo deste trabalho é investigar como os artigos de notícias podem afetar os preços do petróleo bruto, utilizando o conceito de modelação de tópicos. Utilizando o método web-scraping, os dados para o estudo provêm de um grande conjunto de artigos de notícias sobre a indústria do petróleo bruto. Estes artigos foram publicados entre 1 de janeiro e 31 de dezembro de 2022 e resultam de quatro fontes diferentes. Os dados foram compilados usando a fonte Exchange Rates UK para demonstrar como o preço do petróleo bruto flutuou ao longo deste período. Após a conclusão do processo de limpeza, obteve-se um total de 1532 artigos de notícias. A técnica Latent Dirichlet Allocation (LDA) é sugerida para extrair as palavras-chave pertinentes dos artigos de notícias. Os seus resultados foram depois utilizados como variáveis de entrada para prever o preço do petróleo bruto. Os métodos de previsão utilizados no estudo foram os modelos Ridge, Random Forest, XGBoost e ARIMAX. Os resultados indicam que a relação entre os artigos de notícias e o preço do petróleo bruto não é suficientemente forte. Conclui-se que o algoritmo XGBoost revela um desempenho preditivo superior no conjunto de treino. Como resultado, foram desenvolvidos modelos XGBoost para cada mês de 2022 para investigar o impacto das características e determinar as mais importantes para o problema
    corecore