9,997 research outputs found
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model
for generating novel image captions. It directly models the probability
distribution of generating a word given previous words and an image. Image
captions are generated by sampling from this distribution. The model consists
of two sub-networks: a deep recurrent neural network for sentences and a deep
convolutional network for images. These two sub-networks interact with each
other in a multimodal layer to form the whole m-RNN model. The effectiveness of
our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K,
Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In
addition, we apply the m-RNN model to retrieval tasks for retrieving images or
sentences, and achieves significant performance improvement over the
state-of-the-art methods which directly optimize the ranking objective function
for retrieval. The project page of this work is:
www.stat.ucla.edu/~junhua.mao/m-RNN.html .Comment: Add a simple strategy to boost the performance of image captioning
task significantly. More details are shown in Section 8 of the paper. The
code and related data are available at https://github.com/mjhucla/mRNN-CR ;.
arXiv admin note: substantial text overlap with arXiv:1410.109
Recurrent Multimodal Interaction for Referring Image Segmentation
In this paper we are interested in the problem of image segmentation given
natural language descriptions, i.e. referring expressions. Existing works
tackle this problem by first modeling images and sentences independently and
then segment images by combining these two types of representations. We argue
that learning word-to-image interaction is more native in the sense of jointly
modeling two modalities for the image segmentation task, and we propose
convolutional multimodal LSTM to encode the sequential interactions between
individual words, visual information, and spatial information. We show that our
proposed model outperforms the baseline model on benchmark datasets. In
addition, we analyze the intermediate output of the proposed multimodal LSTM
approach and empirically explain how this approach enforces a more effective
word-to-image interaction.Comment: To appear in ICCV 2017. See http://www.cs.jhu.edu/~cxliu/ for code
and supplementary materia
Multimodal Convolutional Neural Networks for Matching Image and Sentence
In this paper, we propose multimodal convolutional neural networks (m-CNNs)
for matching image and sentence. Our m-CNN provides an end-to-end framework
with convolutional architectures to exploit image representation, word
composition, and the matching relations between the two modalities. More
specifically, it consists of one image CNN encoding the image content, and one
matching CNN learning the joint representation of image and sentence. The
matching CNN composes words to different semantic fragments and learns the
inter-modal relations between image and the composed fragments at different
levels, thus fully exploit the matching relations between image and sentence.
Experimental results on benchmark databases of bidirectional image and sentence
retrieval demonstrate that the proposed m-CNNs can effectively capture the
information necessary for image and sentence matching. Specifically, our
proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and
Microsoft COCO databases achieve the state-of-the-art performances.Comment: Accepted by ICCV 201
Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation
In the traditional object recognition pipeline, descriptors are densely
sampled over an image, pooled into a high dimensional non-linear representation
and then passed to a classifier. In recent years, Fisher Vectors have proven
empirically to be the leading representation for a large variety of
applications. The Fisher Vector is typically taken as the gradients of the
log-likelihood of descriptors, with respect to the parameters of a Gaussian
Mixture Model (GMM). Motivated by the assumption that different distributions
should be applied for different datasets, we present two other Mixture Models
and derive their Expectation-Maximization and Fisher Vector expressions. The
first is a Laplacian Mixture Model (LMM), which is based on the Laplacian
distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian
Mixture Model (HGLMM) which is based on a weighted geometric mean of the
Gaussian and Laplacian distribution. An interesting property of the
Expectation-Maximization algorithm for the latter is that in the maximization
step, each dimension in each component is chosen to be either a Gaussian or a
Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we
achieve state-of-the-art results for both the image annotation and the image
search by a sentence tasks.Comment: new version includes text synthesis by an RNN and experiments with
the COCO benchmar
- …