904 research outputs found
Dirichlet belief networks for topic structure learning
Recently, considerable research effort has been devoted to developing deep
architectures for topic models to learn topic structures. Although several deep
models have been proposed to learn better topic proportions of documents, how
to leverage the benefits of deep structures for learning word distributions of
topics has not yet been rigorously studied. Here we propose a new multi-layer
generative process on word distributions of topics, where each layer consists
of a set of topics and each topic is drawn from a mixture of the topics of the
layer above. As the topics in all layers can be directly interpreted by words,
the proposed model is able to discover interpretable topic hierarchies. As a
self-contained module, our model can be flexibly adapted to different kinds of
topic models to improve their modelling accuracy and interpretability.
Extensive experiments on text corpora demonstrate the advantages of the
proposed model.Comment: accepted in NIPS 201
A Spectral Algorithm for Latent Dirichlet Allocation
The problem of topic modeling can be seen as a generalization of the
clustering problem, in that it posits that observations are generated due to
multiple latent factors (e.g., the words in each document are generated as a
mixture of several active topics, as opposed to just one). This increased
representational power comes at the cost of a more challenging unsupervised
learning problem of estimating the topic probability vectors (the distributions
over words for each topic), when only the words are observed and the
corresponding topics are hidden.
We provide a simple and efficient learning procedure that is guaranteed to
recover the parameters for a wide class of mixture models, including the
popular latent Dirichlet allocation (LDA) model. For LDA, the procedure
correctly recovers both the topic probability vectors and the prior over the
topics, using only trigram statistics (i.e., third order moments, which may be
estimated with documents containing just three words). The method, termed
Excess Correlation Analysis (ECA), is based on a spectral decomposition of low
order moments (third and fourth order) via two singular value decompositions
(SVDs). Moreover, the algorithm is scalable since the SVD operations are
carried out on matrices, where is the number of latent factors
(e.g. the number of topics), rather than in the -dimensional observed space
(typically ).Comment: Changed title to match conference version, which appears in Advances
in Neural Information Processing Systems 25, 201
The Discrete Infinite Logistic Normal Distribution
We present the discrete infinite logistic normal distribution (DILN), a
Bayesian nonparametric prior for mixed membership models. DILN is a
generalization of the hierarchical Dirichlet process (HDP) that models
correlation structure between the weights of the atoms at the group level. We
derive a representation of DILN as a normalized collection of gamma-distributed
random variables, and study its statistical properties. We consider
applications to topic modeling and derive a variational inference algorithm for
approximate posterior inference. We study the empirical performance of the DILN
topic model on four corpora, comparing performance with the HDP and the
correlated topic model (CTM). To deal with large-scale data sets, we also
develop an online inference algorithm for DILN and compare with online HDP and
online LDA on the Nature magazine, which contains approximately 350,000
articles.Comment: This paper will appear in Bayesian Analysis. A shorter version of
this paper appeared at AISTATS 2011, Fort Lauderdale, FL, US
A CASE STUDY ON THE EFFECT OF NEWS ON CRUDE OIL PRICE
Crude oil price volatility has an impact on the global economy and oil-dependent industries
and is influenced by supply and demand, geopolitical tensions, and the global
economy. Every day, a massive amount of textual information flows in the form of news
articles, which humans use to forecast future trends. News articles can have a significant
impact on the price of crude oil because they contain information about recent events,
trends, and advancements in the industry.
The purpose of this work is to investigate how news articles may affect crude oil prices,
using the concept of topic modeling and its potential for handling data. Using the webscraping
method, the data for the study comes from a large dataset of news articles about
the crude oil industry. These news articles were published between January 1 and December
31, 2022, and come from four different sources. The data was compiled using the
source Exchange Rates UK to demonstrate how the price of crude oil fluctuated during
this period. After the cleaning process was completed, the dataset contained a total of
1532 news articles.
The Latent Dirichlet Allocation (LDA) technique is suggested for extracting relevant
keywords from news articles and then using the findings as input features to forecast the
crude oil price.
The forecasting methods employed in the study were the Ridge model, the Random
Forest and XGBoost techniques, and the time series method ARIMAX. The outcomes of
the experiment indicate that the association between the meaning of the news articles and
the crude oil price is not sufficiently strong.
It is additionally concluded that the XGBoost algorithm reveal superior predictive
performance in the training set. As a result, XGBoost models for each month of 2022
were developed to investigate the impact of features and determine the most important
ones for the problem.A volatilidade dos preços do petróleo bruto tem um impacto na economia global e nas
indústrias dependentes do petróleo e é influenciada pela oferta e procura, por tensões geopolíticas
e pela economia global. Todos os dias uma enorme quantidade de informação
flui sob a forma de artigos de notícias e é utilizada pelo ser humano para prever tendências
futuras. Os artigos de notícias podem influenciar significativamente o preço do
petróleo bruto porque contêm informação sobre eventos recentes, tendências e avanços
na indústria.
O objetivo deste trabalho é investigar como os artigos de notícias podem afetar os
preços do petróleo bruto, utilizando o conceito de modelação de tópicos.
Utilizando o método web-scraping, os dados para o estudo provêm de um grande conjunto
de artigos de notícias sobre a indústria do petróleo bruto. Estes artigos foram publicados
entre 1 de janeiro e 31 de dezembro de 2022 e resultam de quatro fontes diferentes.
Os dados foram compilados usando a fonte Exchange Rates UK para demonstrar como
o preço do petróleo bruto flutuou ao longo deste período. Após a conclusão do processo
de limpeza, obteve-se um total de 1532 artigos de notícias.
A técnica Latent Dirichlet Allocation (LDA) é sugerida para extrair as palavras-chave
pertinentes dos artigos de notícias. Os seus resultados foram depois utilizados como variáveis
de entrada para prever o preço do petróleo bruto.
Os métodos de previsão utilizados no estudo foram os modelos Ridge, Random Forest,
XGBoost e ARIMAX. Os resultados indicam que a relação entre os artigos de notícias e o
preço do petróleo bruto não é suficientemente forte.
Conclui-se que o algoritmo XGBoost revela um desempenho preditivo superior no
conjunto de treino. Como resultado, foram desenvolvidos modelos XGBoost para cada
mês de 2022 para investigar o impacto das características e determinar as mais importantes
para o problema
- …