1,811 research outputs found

    Bayesian Mixture Models for Frequent Itemset Discovery

    Full text link
    In binary-transaction data-mining, traditional frequent itemset mining often produces results which are not straightforward to interpret. To overcome this problem, probability models are often used to produce more compact and conclusive results, albeit with some loss of accuracy. Bayesian statistics have been widely used in the development of probability models in machine learning in recent years and these methods have many advantages, including their abilities to avoid overfitting. In this paper, we develop two Bayesian mixture models with the Dirichlet distribution prior and the Dirichlet process (DP) prior to improve the previous non-Bayesian mixture model developed for transaction dataset mining. We implement the inference of both mixture models using two methods: a collapsed Gibbs sampling scheme and a variational approximation algorithm. Experiments in several benchmark problems have shown that both mixture models achieve better performance than a non-Bayesian mixture model. The variational algorithm is the faster of the two approaches while the Gibbs sampling method achieves a more accurate results. The Dirichlet process mixture model can automatically grow to a proper complexity for a better approximation. Once the model is built, it can be very fast to query and run analysis on (typically 10 times faster than Eclat, as we will show in the experiment section). However, these approaches also show that mixture models underestimate the probabilities of frequent itemsets. Consequently, these models have a higher sensitivity but a lower specificity

    Combining Random Walks and Nonparametric Bayesian Topic Model for Community Detection

    Full text link
    Community detection has been an active research area for decades. Among all probabilistic models, Stochastic Block Model has been the most popular one. This paper introduces a novel probabilistic model: RW-HDP, based on random walks and Hierarchical Dirichlet Process, for community extraction. In RW-HDP, random walks conducted in a social network are treated as documents; nodes are treated as words. By using Hierarchical Dirichlet Process, a nonparametric Bayesian model, we are not only able to cluster nodes into different communities, but also determine the number of communities automatically. We use Stochastic Variational Inference for our model inference, which makes our method time efficient and can be easily extended to an online learning algorithm

    Semi-Automatic Terminology Ontology Learning Based on Topic Modeling

    Full text link
    Ontologies provide features like a common vocabulary, reusability, machine-readable content, and also allows for semantic search, facilitate agent interaction and ordering & structuring of knowledge for the Semantic Web (Web 3.0) application. However, the challenge in ontology engineering is automatic learning, i.e., the there is still a lack of fully automatic approach from a text corpus or dataset of various topics to form ontology using machine learning techniques. In this paper, two topic modeling algorithms are explored, namely LSI & SVD and Mr.LDA for learning topic ontology. The objective is to determine the statistical relationship between document and terms to build a topic ontology and ontology graph with minimum human intervention. Experimental analysis on building a topic ontology and semantic retrieving corresponding topic ontology for the user's query demonstrating the effectiveness of the proposed approach

    Infinite Shift-invariant Grouped Multi-task Learning for Gaussian Processes

    Full text link
    Multi-task learning leverages shared information among data sets to improve the learning performance of individual tasks. The paper applies this framework for data where each task is a phase-shifted periodic time series. In particular, we develop a novel Bayesian nonparametric model capturing a mixture of Gaussian processes where each task is a sum of a group-specific function and a component capturing individual variation, in addition to each task being phase shifted. We develop an efficient \textsc{em} algorithm to learn the parameters of the model. As a special case we obtain the Gaussian mixture model and \textsc{em} algorithm for phased-shifted periodic time series. Furthermore, we extend the proposed model by using a Dirichlet Process prior and thereby leading to an infinite mixture model that is capable of doing automatic model selection. A Variational Bayesian approach is developed for inference in this model. Experiments in regression, classification and class discovery demonstrate the performance of the proposed models using both synthetic data and real-world time series data from astrophysics. Our methods are particularly useful when the time series are sparsely and non-synchronously sampled.Comment: This is an extended version of our ECML 2010 paper entitled "Shift-invariant Grouped Multi-task Learning for Gaussian Processes"; ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II

    Mining Associated Text and Images with Dual-Wing Harmoniums

    Full text link
    We propose a multi-wing harmonium model for mining multimedia data that extends and improves on earlier models based on two-layer random fields, which capture bidirectional dependencies between hidden topic aspects and observed inputs. This model can be viewed as an undirected counterpart of the two-layer directed models such as LDA for similar tasks, but bears significant difference in inference/learning cost tradeoffs, latent topic representations, and topic mixing mechanisms. In particular, our model facilitates efficient inference and robust topic mixing, and potentially provides high flexibilities in modeling the latent topic spaces. A contrastive divergence and a variational algorithm are derived for learning. We specialized our model to a dual-wing harmonium for captioned images, incorporating a multivariate Poisson for word-counts and a multivariate Gaussian for color histogram. We present empirical results on the applications of this model to classification, retrieval and image annotation on news video collections, and we report an extensive comparison with various extant models.Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005

    Scalable Probabilistic Entity-Topic Modeling

    Full text link
    We present an LDA approach to entity disambiguation. Each topic is associated with a Wikipedia article and topics generate either content words or entity mentions. Training such models is challenging because of the topic and vocabulary size, both in the millions. We tackle these problems using a novel distributed inference and representation framework based on a parallel Gibbs sampler guided by the Wikipedia link graph, and pipelines of MapReduce allowing fast and memory-frugal processing of large datasets. We report state-of-the-art performance on a public dataset

    Multilingual Topic Models

    Full text link
    Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.Comment: 18 pages, 9 figure

    Component models for large networks

    Full text link
    Being among the easiest ways to find meaningful structure from discrete data, Latent Dirichlet Allocation (LDA) and related component models have been applied widely. They are simple, computationally fast and scalable, interpretable, and admit nonparametric priors. In the currently popular field of network modeling, relatively little work has taken uncertainty of data seriously in the Bayesian sense, and component models have been introduced to the field only recently, by treating each node as a bag of out-going links. We introduce an alternative, interaction component model for communities (ICMc), where the whole network is a bag of links, stemming from different components. The former finds both disassortative and assortative structure, while the alternative assumes assortativity and finds community-like structures like the earlier methods motivated by physics. With Dirichlet Process priors and an efficient implementation the models are highly scalable, as demonstrated with a social network from the Last.fm web site, with 670,000 nodes and 1.89 million links

    Spatial Semantic Scan: Jointly Detecting Subtle Events and their Spatial Footprint

    Full text link
    Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spatially compact events in text streams. SCSS employs alternating optimization between using semantic scan to estimate contrastive foreground topics in documents, and discovering spatial neighborhoods with high occurrence of documents containing the foreground topics. We evaluate our method on Emergency Department chief complaints dataset (ED dataset) to verify the effectiveness of our method in detecting real-world disease outbreaks from free-text ED chief complaint data.Comment: 26 page

    Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey

    Full text link
    Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data, text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modeling, which Latent Dirichlet allocation (LDA) is one of the most popular methods in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper can be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated scholarly articles highly (between 2003 to 2016) related to Topic Modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. Also, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.Comment: arXiv admin note: text overlap with arXiv:1505.07302 by other author
    • …
    corecore