4,257 research outputs found
A Deep and Autoregressive Approach for Topic Modeling of Multimodal Data
Topic modeling based on latent Dirichlet allocation (LDA) has been a
framework of choice to deal with multimodal data, such as in image annotation
tasks. Another popular approach to model the multimodal data is through deep
neural networks, such as the deep Boltzmann machine (DBM). Recently, a new type
of topic model called the Document Neural Autoregressive Distribution Estimator
(DocNADE) was proposed and demonstrated state-of-the-art performance for text
document modeling. In this work, we show how to successfully apply and extend
this model to multimodal data, such as simultaneous image classification and
annotation. First, we propose SupDocNADE, a supervised extension of DocNADE,
that increases the discriminative power of the learned hidden topic features
and show how to employ it to learn a joint representation from image visual
words, annotation words and class label information. We test our model on the
LabelMe and UIUC-Sports data sets and show that it compares favorably to other
topic models. Second, we propose a deep extension of our model and provide an
efficient way of training the deep model. Experimental results show that our
deep model outperforms its shallow version and reaches state-of-the-art
performance on the Multimedia Information Retrieval (MIR) Flickr data set.Comment: 24 pages, 10 figures. A version has been accepted by TPAMI on Aug
4th, 2015. Add footnote about how to train the model in practice in Section
5.1. arXiv admin note: substantial text overlap with arXiv:1305.530
Learning Object Categories From Internet Image Searches
In this paper, we describe a simple approach to learning models of visual object categories from images gathered from Internet image search engines. The images for a given keyword are typically highly variable, with a large fraction being unrelated to the query term, and thus pose a challenging environment from which to learn. By training our models directly from Internet images, we remove the need to laboriously compile training data sets, required by most other recognition approaches-this opens up the possibility of learning object category models “on-the-fly.” We describe two simple approaches, derived from the probabilistic latent semantic analysis (pLSA) technique for text document analysis, that can be used to automatically learn object models from these data. We show two applications of the learned model: first, to rerank the images returned by the search engine, thus improving the quality of the search engine; and second, to recognize objects in other image data sets
AXES at TRECVid 2011
The AXES project participated in the interactive known-item search task (KIS) and the interactive instance search task (INS) for TRECVid 2011. We used the same system architecture and a nearly identical user interface for both the KIS and INS tasks. Both systems made use of text search on ASR, visual concept detectors, and visual similarity search. The user experiments were carried out with media professionals and media students at the Netherlands Institute for Sound and Vision, with media professionals performing the KIS task and media students participating in the INS task. This paper describes the results and findings of our experiments
Scene classification using spatial pyramid matching and hierarchical Dirichlet processes
The goal of scene classification is to automatically assign a scene image to a semantic category (i.e. building or river ) based on analyzing the visual contents of this image. This is a challenging problem due to the scene images\u27 variability, ambiguity, and a wide range of illumination or scale conditions that may apply. On the contrary, it is a fundamental problem in computer vision and can be used to guide other processes such as image browsing, contentbased image retrieval and object recognition by providing contextual information. This thesis implemented two scene classification systems: one is based on Spatial Pyramid Matching (SPM) and the other one is applying Hierarchical Dirichlet Processes (HDP). Both approaches are based on the most popular bag-of-words representation, which is a histogram of quantized visual features. SPM represents an image as a spatial pyramid which is produced by computing histograms of local features for multiple levels with different resolutions. Spatial Pyramid Matching is then used to estimate the overall perceptual similarity between images which can be used as a support vector machine (SVM) kernel. In the second approach, HDP is used to model the bag-of-words representations of images; each image is described as a mixture of latent themes and each theme is described as a mixture of words. The number of themes is automatically inferred from data. The themes are shared by images not only inside one scene category but also across all categories. Both systems are tested on three popular datasets from the field and their performances are compared. In addition, the two approaches are combined, resulting in performance improvement over either separate system
Stacking-based Deep Neural Network: Deep Analytic Network on Convolutional Spectral Histogram Features
Stacking-based deep neural network (S-DNN), in general, denotes a deep neural
network (DNN) resemblance in terms of its very deep, feedforward network
architecture. The typical S-DNN aggregates a variable number of individually
learnable modules in series to assemble a DNN-alike alternative to the targeted
object recognition tasks. This work likewise devises an S-DNN instantiation,
dubbed deep analytic network (DAN), on top of the spectral histogram (SH)
features. The DAN learning principle relies on ridge regression, and some key
DNN constituents, specifically, rectified linear unit, fine-tuning, and
normalization. The DAN aptitude is scrutinized on three repositories of varying
domains, including FERET (faces), MNIST (handwritten digits), and CIFAR10
(natural objects). The empirical results unveil that DAN escalates the SH
baseline performance over a sufficiently deep layer.Comment: 5 page
- …