20 research outputs found
Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
It is well believed that video captioning is a fundamental but challenging
task in both computer vision and artificial intelligence fields. The prevalent
approach is to map an input video to a variable-length output sentence in a
sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless,
the training of RNN still suffers to some degree from vanishing/exploding
gradient problem, making the optimization difficult. Moreover, the inherently
recurrent dependency in RNN prevents parallelization within a sequence during
training and therefore limits the computations. In this paper, we present a
novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks
(dubbed as TDConvED) that fully employ convolutions in both encoder and decoder
networks for video captioning. Technically, we exploit convolutional block
structures that compute intermediate states of a fixed number of inputs and
stack several blocks to capture long-term relationships. The structure in
encoder is further equipped with temporal deformable convolution to enable
free-form deformation of temporal sampling. Our model also capitalizes on
temporal attention mechanism for sentence generation. Extensive experiments are
conducted on both MSVD and MSR-VTT video captioning datasets, and superior
results are reported when comparing to conventional RNN-based encoder-decoder
techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8%
to 67.2% on MSVD.Comment: AAAI 201
Neural Baby Talk
We introduce a novel framework for image captioning that can produce natural
language explicitly grounded in entities that object detectors find in the
image. Our approach reconciles classical slot filling approaches (that are
generally better grounded in images) with modern neural captioning approaches
(that are generally more natural sounding and accurate). Our approach first
generates a sentence `template' with slot locations explicitly tied to specific
image regions. These slots are then filled in by visual concepts identified in
the regions by object detectors. The entire architecture (sentence template
generation and slot filling with object detectors) is end-to-end
differentiable. We verify the effectiveness of our proposed model on different
image captioning tasks. On standard image captioning and novel object
captioning, our model reaches state-of-the-art on both COCO and Flickr30k
datasets. We also demonstrate that our model has unique advantages when the
train and test distributions of scene compositions -- and hence language priors
of associated captions -- are different. Code has been made available at:
https://github.com/jiasenlu/NeuralBabyTalkComment: 12 pages, 7 figures, CVPR 201
Image Captioning with Unseen Objects
Image caption generation is a long standing and challenging problem at the
intersection of computer vision and natural language processing. A number of
recently proposed approaches utilize a fully supervised object recognition
model within the captioning approach. Such models, however, tend to generate
sentences which only consist of objects predicted by the recognition models,
excluding instances of the classes without labelled training examples. In this
paper, we propose a new challenging scenario that targets the image captioning
problem in a fully zero-shot learning setting, where the goal is to be able to
generate captions of test images containing objects that are not seen during
training. The proposed approach jointly uses a novel zero-shot object detection
model and a template-based sentence generator. Our experiments show promising
results on the COCO dataset.Comment: To appear in British Machine Vision Conference (BMVC) 201