136,836 research outputs found
From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning
Video captioning in essential is a complex natural process, which is affected
by various uncertainties stemming from video content, subjective judgment, etc.
In this paper we build on the recent progress in using encoder-decoder
framework for video captioning and address what we find to be a critical
deficiency of the existing methods, that most of the decoders propagate
deterministic hidden states. Such complex uncertainty cannot be modeled
efficiently by the deterministic models. In this paper, we propose a generative
approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which
models the uncertainty observed in the data using latent stochastic variables.
Therefore, MS-RNN can improve the performance of video captioning, and generate
multiple sentences to describe a video considering different random factors.
Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with
both visual and textual features to capture a high-level representation. Then,
a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty
propagation by introducing latent variables. Experimental results on the
challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach
outperforms the state-of-the-art video captioning benchmarks
What value do explicit high level concepts have in vision to language problems?
Much of the recent progress in Vision-to-Language (V2L) problems has been
achieved through a combination of Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs). This approach does not explicitly represent
high-level semantic concepts, but rather seeks to progress directly from image
features to text. We propose here a method of incorporating high-level concepts
into the very successful CNN-RNN approach, and show that it achieves a
significant improvement on the state-of-the-art performance in both image
captioning and visual question answering. We also show that the same mechanism
can be used to introduce external semantic information and that doing so
further improves performance. In doing so we provide an analysis of the value
of high level semantic information in V2L problems.Comment: Accepted to IEEE Conf. Computer Vision and Pattern Recognition 2016.
Fixed titl
- …