7 research outputs found

    Efficient Correlated Topic Modeling with Topic Embedding

    Full text link
    Correlated topic modeling has been limited to small model and problem sizes due to their high computational cost and poor scaling. In this paper, we propose a new model which learns compact topic embeddings and captures topic correlations through the closeness between the topic vectors. Our method enables efficient inference in the low-dimensional embedding space, reducing previous cubic or quadratic time complexity to linear w.r.t the topic size. We further speedup variational inference with a fast sampler to exploit sparsity of topic occurrence. Extensive experiments show that our approach is capable of handling model and data scales which are several orders of magnitude larger than existing correlation results, without sacrificing modeling quality by providing competitive or superior performance in document classification and retrieval.Comment: KDD 2017 oral. The first two authors contributed equall

    Multidimensional Membership Mixture Models

    Full text link
    We present the multidimensional membership mixture (M3) models where every dimension of the membership represents an independent mixture model and each data point is generated from the selected mixture components jointly. This is helpful when the data has a certain shared structure. For example, three unique means and three unique variances can effectively form a Gaussian mixture model with nine components, while requiring only six parameters to fully describe it. In this paper, we present three instantiations of M3 models (together with the learning and inference algorithms): infinite, finite, and hybrid, depending on whether the number of mixtures is fixed or not. They are built upon Dirichlet process mixture models, latent Dirichlet allocation, and a combination respectively. We then consider two applications: topic modeling and learning 3D object arrangements. Our experiments show that our M3 models achieve better performance using fewer topics than many classic topic models. We also observe that topics from the different dimensions of M3 models are meaningful and orthogonal to each other.Comment: 9 pages, 7 figure

    Six papers on computational methods for the analysis of structured and unstructured data in the economic domain

    Get PDF
    This work investigates the application of computational methods for structured and unstructured data. The domains of application are two closely connected fields with the common goal of promoting the stability of the financial system: systemic risk and bank supervision. The work explores different families of models and applies them to different tasks: graphical Gaussian network models to address bank interconnectivity, topic models to monitor bank news and deep learning for text classification. New applications and variants of these models are investigated posing a particular attention on the combined use of textual and structured data. In the penultimate chapter is introduced a sentiment polarity classification tool in Italian, based on deep learning, to simplify future researches relying on sentiment analysis. The different models have proven useful for leveraging numerical (structured) and textual (unstructured) data. Graphical Gaussian Models and Topic models have been adopted for inspection and descriptive tasks while deep learning has been applied more for predictive (classification) problems. Overall, the integration of textual (unstructured) and numerical (structured) information has proven useful for systemic risk and bank supervision related analysis. The integration of textual data with numerical data in fact, has brought either to higher predictive performances or enhanced capability of explaining phenomena and correlating them to other events.This work investigates the application of computational methods for structured and unstructured data. The domains of application are two closely connected fields with the common goal of promoting the stability of the financial system: systemic risk and bank supervision. The work explores different families of models and applies them to different tasks: graphical Gaussian network models to address bank interconnectivity, topic models to monitor bank news and deep learning for text classification. New applications and variants of these models are investigated posing a particular attention on the combined use of textual and structured data. In the penultimate chapter is introduced a sentiment polarity classification tool in Italian, based on deep learning, to simplify future researches relying on sentiment analysis. The different models have proven useful for leveraging numerical (structured) and textual (unstructured) data. Graphical Gaussian Models and Topic models have been adopted for inspection and descriptive tasks while deep learning has been applied more for predictive (classification) problems. Overall, the integration of textual (unstructured) and numerical (structured) information has proven useful for systemic risk and bank supervision related analysis. The integration of textual data with numerical data in fact, has brought either to higher predictive performances or enhanced capability of explaining phenomena and correlating them to other events

    Variational-Based Latent Generalized Dirichlet Allocation Model in the Collapsed Space and Applications

    Get PDF
    In topic modeling framework, many Dirichlet-based models performances have been hindered by the limitations of the conjugate prior. It led to models with more flexible priors, such as the generalized Dirichlet distribution, that tend to capture semantic relationships between topics (topic correlation). Now these extensions also suffer from incomplete generative processes that complicate performances in traditional inferences such as VB (Variational Bayes) and CGS (Collaspsed Gibbs Sampling). As a result, the new approach, the CVB-LGDA (Collapsed Variational Bayesian inference for the Latent Generalized Dirichlet Allocation) presents a scheme that integrates a complete generative process to a robust inference technique for topic correlation and codebook analysis. Its performance in image classification, facial expression recognition, 3D objects categorization, and action recognition in videos shows its merits

    A family of statistical topic models for text and multimedia documents

    No full text
    In this thesis, we investigate several extensions of the basic Latent Dirichlet Allocation model for text and multimedia documents containing images and texts, video and texts, or audio-video and texts. For exploratory analysis of large-scale text document collections, we present Independent Factor Topic Models (IFTM) which captures topic correlations using linear latent variable models to directly uncover the hidden sources of correlations. Such a framework offers great flexibility in exploring different forms of source prior, and in this work we investigate 2 source distributions: Gaussian and Laplacian. When the sparse source prior is used, we can indeed visualize and give interpretation to the sources of correlations and construct a simple topic graph which can be used to navigate large-scale archives. In extending IFTM to learn correlations between latent topics of different data modalities in multimedia documents, we present a topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA) which uses a linear regression module to learn the precise relationships between latent variables in different modalites. We employ tr-mmLDA in an image and video annotation task, where the goal is to learn statistical association between images and their corresponding captions, so that the caption data can be accurately inferred in the test set. When dealing with annotation data that act more similar to class labels, the assumption in tr-mmLDA which allows caption words in the same document to be generated from multiple hidden topics might be overly complex. For such annotation data, we propose a novel statistical topic model called sLDA-bin, which extends supervised Latent Dirichlet Allocation (sLDA) [BM07] model to handle a multi-variate binary response variable of the annotation data. We show superior image annotation and retrieval results comparing sLDA-bin with correspondence LDA [BJ03] on standard image datasets. We also extend the association model for the case of image -text and video-text to perform automatic annotation of multimedia documents containing audio and video, we find that unlike cLDA, tr-mmLDA and sLDA-bin can be straight- forwardly extended to include influence from additional data modalities in predicting annotation by incorporating the latent topics from the additional modality as another set of covariates into the linear and logistic regression module respectivel