1,811 research outputs found
Bayesian Mixture Models for Frequent Itemset Discovery
In binary-transaction data-mining, traditional frequent itemset mining often
produces results which are not straightforward to interpret. To overcome this
problem, probability models are often used to produce more compact and
conclusive results, albeit with some loss of accuracy. Bayesian statistics have
been widely used in the development of probability models in machine learning
in recent years and these methods have many advantages, including their
abilities to avoid overfitting. In this paper, we develop two Bayesian mixture
models with the Dirichlet distribution prior and the Dirichlet process (DP)
prior to improve the previous non-Bayesian mixture model developed for
transaction dataset mining. We implement the inference of both mixture models
using two methods: a collapsed Gibbs sampling scheme and a variational
approximation algorithm. Experiments in several benchmark problems have shown
that both mixture models achieve better performance than a non-Bayesian mixture
model. The variational algorithm is the faster of the two approaches while the
Gibbs sampling method achieves a more accurate results. The Dirichlet process
mixture model can automatically grow to a proper complexity for a better
approximation. Once the model is built, it can be very fast to query and run
analysis on (typically 10 times faster than Eclat, as we will show in the
experiment section). However, these approaches also show that mixture models
underestimate the probabilities of frequent itemsets. Consequently, these
models have a higher sensitivity but a lower specificity
Combining Random Walks and Nonparametric Bayesian Topic Model for Community Detection
Community detection has been an active research area for decades. Among all
probabilistic models, Stochastic Block Model has been the most popular one.
This paper introduces a novel probabilistic model: RW-HDP, based on random
walks and Hierarchical Dirichlet Process, for community extraction. In RW-HDP,
random walks conducted in a social network are treated as documents; nodes are
treated as words. By using Hierarchical Dirichlet Process, a nonparametric
Bayesian model, we are not only able to cluster nodes into different
communities, but also determine the number of communities automatically. We use
Stochastic Variational Inference for our model inference, which makes our
method time efficient and can be easily extended to an online learning
algorithm
Semi-Automatic Terminology Ontology Learning Based on Topic Modeling
Ontologies provide features like a common vocabulary, reusability,
machine-readable content, and also allows for semantic search, facilitate agent
interaction and ordering & structuring of knowledge for the Semantic Web (Web
3.0) application. However, the challenge in ontology engineering is automatic
learning, i.e., the there is still a lack of fully automatic approach from a
text corpus or dataset of various topics to form ontology using machine
learning techniques. In this paper, two topic modeling algorithms are explored,
namely LSI & SVD and Mr.LDA for learning topic ontology. The objective is to
determine the statistical relationship between document and terms to build a
topic ontology and ontology graph with minimum human intervention. Experimental
analysis on building a topic ontology and semantic retrieving corresponding
topic ontology for the user's query demonstrating the effectiveness of the
proposed approach
Infinite Shift-invariant Grouped Multi-task Learning for Gaussian Processes
Multi-task learning leverages shared information among data sets to improve
the learning performance of individual tasks. The paper applies this framework
for data where each task is a phase-shifted periodic time series. In
particular, we develop a novel Bayesian nonparametric model capturing a mixture
of Gaussian processes where each task is a sum of a group-specific function and
a component capturing individual variation, in addition to each task being
phase shifted. We develop an efficient \textsc{em} algorithm to learn the
parameters of the model. As a special case we obtain the Gaussian mixture model
and \textsc{em} algorithm for phased-shifted periodic time series. Furthermore,
we extend the proposed model by using a Dirichlet Process prior and thereby
leading to an infinite mixture model that is capable of doing automatic model
selection. A Variational Bayesian approach is developed for inference in this
model. Experiments in regression, classification and class discovery
demonstrate the performance of the proposed models using both synthetic data
and real-world time series data from astrophysics. Our methods are particularly
useful when the time series are sparsely and non-synchronously sampled.Comment: This is an extended version of our ECML 2010 paper entitled
"Shift-invariant Grouped Multi-task Learning for Gaussian Processes"; ECML
PKDD'10 Proceedings of the 2010 European conference on Machine learning and
knowledge discovery in databases: Part II
Mining Associated Text and Images with Dual-Wing Harmoniums
We propose a multi-wing harmonium model for mining multimedia data that
extends and improves on earlier models based on two-layer random fields, which
capture bidirectional dependencies between hidden topic aspects and observed
inputs. This model can be viewed as an undirected counterpart of the two-layer
directed models such as LDA for similar tasks, but bears significant difference
in inference/learning cost tradeoffs, latent topic representations, and topic
mixing mechanisms. In particular, our model facilitates efficient inference and
robust topic mixing, and potentially provides high flexibilities in modeling
the latent topic spaces. A contrastive divergence and a variational algorithm
are derived for learning. We specialized our model to a dual-wing harmonium for
captioned images, incorporating a multivariate Poisson for word-counts and a
multivariate Gaussian for color histogram. We present empirical results on the
applications of this model to classification, retrieval and image annotation on
news video collections, and we report an extensive comparison with various
extant models.Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty
in Artificial Intelligence (UAI2005
Scalable Probabilistic Entity-Topic Modeling
We present an LDA approach to entity disambiguation. Each topic is associated
with a Wikipedia article and topics generate either content words or entity
mentions. Training such models is challenging because of the topic and
vocabulary size, both in the millions. We tackle these problems using a novel
distributed inference and representation framework based on a parallel Gibbs
sampler guided by the Wikipedia link graph, and pipelines of MapReduce allowing
fast and memory-frugal processing of large datasets. We report state-of-the-art
performance on a public dataset
Multilingual Topic Models
Scientific publications have evolved several features for mitigating
vocabulary mismatch when indexing, retrieving, and computing similarity between
articles. These mitigation strategies range from simply focusing on high-value
article sections, such as titles and abstracts, to assigning keywords, often
from controlled vocabularies, either manually or through automatic annotation.
Various document representation schemes possess different cost-benefit
tradeoffs. In this paper, we propose to model different representations of the
same article as translations of each other, all generated from a common latent
representation in a multilingual topic model. We start with a methodological
overview on latent variable models for parallel document representations that
could be used across many information science tasks. We then show how solving
the inference problem of mapping diverse representations into a shared topic
space allows us to evaluate representations based on how topically similar they
are to the original article. In addition, our proposed approach provides means
to discover where different concept vocabularies require improvement.Comment: 18 pages, 9 figure
Component models for large networks
Being among the easiest ways to find meaningful structure from discrete data,
Latent Dirichlet Allocation (LDA) and related component models have been
applied widely. They are simple, computationally fast and scalable,
interpretable, and admit nonparametric priors. In the currently popular field
of network modeling, relatively little work has taken uncertainty of data
seriously in the Bayesian sense, and component models have been introduced to
the field only recently, by treating each node as a bag of out-going links. We
introduce an alternative, interaction component model for communities (ICMc),
where the whole network is a bag of links, stemming from different components.
The former finds both disassortative and assortative structure, while the
alternative assumes assortativity and finds community-like structures like the
earlier methods motivated by physics. With Dirichlet Process priors and an
efficient implementation the models are highly scalable, as demonstrated with a
social network from the Last.fm web site, with 670,000 nodes and 1.89 million
links
Spatial Semantic Scan: Jointly Detecting Subtle Events and their Spatial Footprint
Many methods have been proposed for detecting emerging events in text streams
using topic modeling. However, these methods have shortcomings that make them
unsuitable for rapid detection of locally emerging events on massive text
streams. We describe Spatially Compact Semantic Scan (SCSS) that has been
developed specifically to overcome the shortcomings of current methods in
detecting new spatially compact events in text streams. SCSS employs
alternating optimization between using semantic scan to estimate contrastive
foreground topics in documents, and discovering spatial neighborhoods with high
occurrence of documents containing the foreground topics. We evaluate our
method on Emergency Department chief complaints dataset (ED dataset) to verify
the effectiveness of our method in detecting real-world disease outbreaks from
free-text ED chief complaint data.Comment: 26 page
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey
Topic modeling is one of the most powerful techniques in text mining for data
mining, latent data discovery, and finding relationships among data, text
documents. Researchers have published many articles in the field of topic
modeling and applied in various fields such as software engineering, political
science, medical and linguistic science, etc. There are various methods for
topic modeling, which Latent Dirichlet allocation (LDA) is one of the most
popular methods in this field. Researchers have proposed various models based
on the LDA in topic modeling. According to previous work, this paper can be
very useful and valuable for introducing LDA approaches in topic modeling. In
this paper, we investigated scholarly articles highly (between 2003 to 2016)
related to Topic Modeling based on LDA to discover the research development,
current trends and intellectual structure of topic modeling. Also, we summarize
challenges and introduce famous tools and datasets in topic modeling based on
LDA.Comment: arXiv admin note: text overlap with arXiv:1505.07302 by other author
- …