30 research outputs found
Combining Thesaurus Knowledge and Probabilistic Topic Models
In this paper we present the approach of introducing thesaurus knowledge into
probabilistic topic models. The main idea of the approach is based on the
assumption that the frequencies of semantically related words and phrases,
which are met in the same texts, should be enhanced: this action leads to their
larger contribution into topics found in these texts. We have conducted
experiments with several thesauri and found that for improving topic models, it
is useful to utilize domain-specific knowledge. If a general thesaurus, such as
WordNet, is used, the thesaurus-based improvement of topic models can be
achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final
publication will be available at link.springer.co
Latent Dirichlet Allocation Uncovers Spectral Characteristics of Drought Stressed Plants
Understanding the adaptation process of plants to drought stress is essential
in improving management practices, breeding strategies as well as engineering
viable crops for a sustainable agriculture in the coming decades.
Hyper-spectral imaging provides a particularly promising approach to gain such
understanding since it allows to discover non-destructively spectral
characteristics of plants governed primarily by scattering and absorption
characteristics of the leaf internal structure and biochemical constituents.
Several drought stress indices have been derived using hyper-spectral imaging.
However, they are typically based on few hyper-spectral images only, rely on
interpretations of experts, and consider few wavelengths only. In this study,
we present the first data-driven approach to discovering spectral drought
stress indices, treating it as an unsupervised labeling problem at massive
scale. To make use of short range dependencies of spectral wavelengths, we
develop an online variational Bayes algorithm for latent Dirichlet allocation
with convolved Dirichlet regularizer. This approach scales to massive datasets
and, hence, provides a more objective complement to plant physiological
practices. The spectral topics found conform to plant physiological knowledge
and can be computed in a fraction of the time compared to existing LDA
approaches.Comment: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI2012
Evidence Gap Maps as Critical Information Communication Devices for Evidence-based Public Policy
The public policy cycle requires increasingly the use of evidence by policy
makers. Evidence Gap Maps (EGMs) are a relatively new methodology that helps
identify, process, and visualize the vast amounts of studies representing a
rich source of evidence for better policy making. This document performs a
methodological review of EGMs and presents the development of a working
integrated system that automates several critical steps of EGM creation by
means of applied computational and statistical methods. Above all, the proposed
system encompasses all major steps of EGM creation in one place, namely
inclusion criteria determination, processing of information, analysis, and
user-friendly communication of synthesized relevant evidence. This tool
represents a critical milestone in the efforts of implementing cutting-edge
computational methods in usable systems. The contribution of the document is
two-fold. First, it presents the critical importance of EGMs in the public
policy cycle; second, it justifies and explains the development of a usable
tool that encompasses the methodological phases of creation of EGMs, while
automating most time-consuming stages of the process. The overarching goal is
the better and faster information communication to relevant actors like policy
makers, thus promoting well-being through better and more efficient
interventions based on more evidence-driven policy making.Comment: Working pape
Multidimensional Membership Mixture Models
We present the multidimensional membership mixture (M3) models where every
dimension of the membership represents an independent mixture model and each
data point is generated from the selected mixture components jointly. This is
helpful when the data has a certain shared structure. For example, three unique
means and three unique variances can effectively form a Gaussian mixture model
with nine components, while requiring only six parameters to fully describe it.
In this paper, we present three instantiations of M3 models (together with the
learning and inference algorithms): infinite, finite, and hybrid, depending on
whether the number of mixtures is fixed or not. They are built upon Dirichlet
process mixture models, latent Dirichlet allocation, and a combination
respectively. We then consider two applications: topic modeling and learning 3D
object arrangements. Our experiments show that our M3 models achieve better
performance using fewer topics than many classic topic models. We also observe
that topics from the different dimensions of M3 models are meaningful and
orthogonal to each other.Comment: 9 pages, 7 figure
TOPIC MODELLING METHODOLOGY: ITS USE IN INFORMATION SYSTEMS AND OTHER MANAGERIAL DISCIPLINES
Over the last decade, quantitative text mining approaches to content analysis have gained increasing traction within information systems research, and related fields, such as business administration. Recently, topic models, which are supposed to provide their user with an overview of themes being dis-cussed in documents, have gained popularity. However, while convenient tools for the creation of this model class exist, the evaluation of topic models poses significant challenges to their users. In this research, we investigate how questions of model validity and trustworthiness of presented analyses are addressed across disciplines. We accomplish this by providing a structured review of methodological approaches across the Financial Times 50 journal ranking. We identify 59 methodological research papers, 24 implementations of topic models, as well as 33 research papers using topic models in In-formation Systems (IS) research, and 29 papers using such models in other managerial disciplines. Results indicate a need for model implementations usable by a wider audience, as well as the need for more implementations of model validation techniques, and the need for a discussion about the theoretical foundations of topic modelling based research
Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability
Understanding the shopping motivations behind market baskets has high
commercial value in the grocery retail industry. Analyzing shopping
transactions demands techniques that can cope with the volume and
dimensionality of grocery transactional data while keeping interpretable
outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to
process grocery transactions and to discover a broad representation of
customers' shopping motivations. However, summarizing the posterior
distribution of an LDA model is challenging, while individual LDA draws may not
be coherent and cannot capture topic uncertainty. Moreover, the evaluation of
LDA models is dominated by model-fit measures which may not adequately capture
the qualitative aspects such as interpretability and stability of topics.
In this paper, we introduce clustering methodology that post-processes
posterior LDA draws to summarise the entire posterior distribution and identify
semantic modes represented as recurrent topics. Our approach is an alternative
to standard label-switching techniques and provides a single posterior summary
set of topics, as well as associated measures of uncertainty. Furthermore, we
establish a more holistic definition for model evaluation, which assesses topic
models based not only on their likelihood but also on their coherence,
distinctiveness and stability. By means of a survey, we set thresholds for the
interpretation of topic coherence and topic similarity in the domain of grocery
retail data. We demonstrate that the selection of recurrent topics through our
clustering methodology not only improves model likelihood but also outperforms
the qualitative aspects of LDA such as interpretability and stability. We
illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure