A family of statistical topic models for text and multimedia documents

Abstract

In this thesis, we investigate several extensions of the basic Latent Dirichlet Allocation model for text and multimedia documents containing images and texts, video and texts, or audio-video and texts. For exploratory analysis of large-scale text document collections, we present Independent Factor Topic Models (IFTM) which captures topic correlations using linear latent variable models to directly uncover the hidden sources of correlations. Such a framework offers great flexibility in exploring different forms of source prior, and in this work we investigate 2 source distributions: Gaussian and Laplacian. When the sparse source prior is used, we can indeed visualize and give interpretation to the sources of correlations and construct a simple topic graph which can be used to navigate large-scale archives. In extending IFTM to learn correlations between latent topics of different data modalities in multimedia documents, we present a topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA) which uses a linear regression module to learn the precise relationships between latent variables in different modalites. We employ tr-mmLDA in an image and video annotation task, where the goal is to learn statistical association between images and their corresponding captions, so that the caption data can be accurately inferred in the test set. When dealing with annotation data that act more similar to class labels, the assumption in tr-mmLDA which allows caption words in the same document to be generated from multiple hidden topics might be overly complex. For such annotation data, we propose a novel statistical topic model called sLDA-bin, which extends supervised Latent Dirichlet Allocation (sLDA) [BM07] model to handle a multi-variate binary response variable of the annotation data. We show superior image annotation and retrieval results comparing sLDA-bin with correspondence LDA [BJ03] on standard image datasets. We also extend the association model for the case of image -text and video-text to perform automatic annotation of multimedia documents containing audio and video, we find that unlike cLDA, tr-mmLDA and sLDA-bin can be straight- forwardly extended to include influence from additional data modalities in predicting annotation by incorporating the latent topics from the additional modality as another set of covariates into the linear and logistic regression module respectivel

    Similar works