1,434 research outputs found
Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
It is well believed that video captioning is a fundamental but challenging
task in both computer vision and artificial intelligence fields. The prevalent
approach is to map an input video to a variable-length output sentence in a
sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless,
the training of RNN still suffers to some degree from vanishing/exploding
gradient problem, making the optimization difficult. Moreover, the inherently
recurrent dependency in RNN prevents parallelization within a sequence during
training and therefore limits the computations. In this paper, we present a
novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks
(dubbed as TDConvED) that fully employ convolutions in both encoder and decoder
networks for video captioning. Technically, we exploit convolutional block
structures that compute intermediate states of a fixed number of inputs and
stack several blocks to capture long-term relationships. The structure in
encoder is further equipped with temporal deformable convolution to enable
free-form deformation of temporal sampling. Our model also capitalizes on
temporal attention mechanism for sentence generation. Extensive experiments are
conducted on both MSVD and MSR-VTT video captioning datasets, and superior
results are reported when comparing to conventional RNN-based encoder-decoder
techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8%
to 67.2% on MSVD.Comment: AAAI 201
From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning
Video captioning in essential is a complex natural process, which is affected
by various uncertainties stemming from video content, subjective judgment, etc.
In this paper we build on the recent progress in using encoder-decoder
framework for video captioning and address what we find to be a critical
deficiency of the existing methods, that most of the decoders propagate
deterministic hidden states. Such complex uncertainty cannot be modeled
efficiently by the deterministic models. In this paper, we propose a generative
approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which
models the uncertainty observed in the data using latent stochastic variables.
Therefore, MS-RNN can improve the performance of video captioning, and generate
multiple sentences to describe a video considering different random factors.
Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with
both visual and textual features to capture a high-level representation. Then,
a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty
propagation by introducing latent variables. Experimental results on the
challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach
outperforms the state-of-the-art video captioning benchmarks
MAT: A Multimodal Attentive Translator for Image Captioning
In this work we formulate the problem of image captioning as a multimodal
translation task. Analogous to machine translation, we present a
sequence-to-sequence recurrent neural networks (RNN) model for image caption
generation. Different from most existing work where the whole image is
represented by convolutional neural network (CNN) feature, we propose to
represent the input image as a sequence of detected objects which feeds as the
source sequence of the RNN model. In this way, the sequential representation of
an image can be naturally translated to a sequence of words, as the target
sequence of the RNN model. To represent the image in a sequential way, we
extract the objects features in the image and arrange them in a order using
convolutional neural networks. To further leverage the visual information from
the encoded objects, a sequential attention layer is introduced to selectively
attend to the objects that are related to generate corresponding words in the
sentences. Extensive experiments are conducted to validate the proposed
approach on popular benchmark dataset, i.e., MS COCO, and the proposed model
surpasses the state-of-the-art methods in all metrics following the dataset
splits of previous work. The proposed approach is also evaluated by the
evaluation server of MS COCO captioning challenge, and achieves very
competitive results, e.g., a CIDEr of 1.029 (c5) and 1.064 (c40)
- …