1,371 research outputs found
Excitation Backprop for RNNs
Deep models are state-of-the-art for many vision tasks including video action
recognition and video captioning. Models are trained to caption or classify
activity in videos, but little is known about the evidence used to make such
decisions. Grounding decisions made by deep networks has been studied in
spatial visual content, giving more insight into model predictions for images.
However, such studies are relatively lacking for models of spatiotemporal
visual content - videos. In this work, we devise a formulation that
simultaneously grounds evidence in space and time, in a single pass, using
top-down saliency. We visualize the spatiotemporal cues that contribute to a
deep model's classification/captioning output using the model's internal
representation. Based on these spatiotemporal cues, we are able to localize
segments within a video that correspond with a specific action, or phrase from
a caption, without explicitly optimizing/training for these tasks.Comment: CVPR 2018 Camera Ready Versio
From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning
Video captioning in essential is a complex natural process, which is affected
by various uncertainties stemming from video content, subjective judgment, etc.
In this paper we build on the recent progress in using encoder-decoder
framework for video captioning and address what we find to be a critical
deficiency of the existing methods, that most of the decoders propagate
deterministic hidden states. Such complex uncertainty cannot be modeled
efficiently by the deterministic models. In this paper, we propose a generative
approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which
models the uncertainty observed in the data using latent stochastic variables.
Therefore, MS-RNN can improve the performance of video captioning, and generate
multiple sentences to describe a video considering different random factors.
Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with
both visual and textual features to capture a high-level representation. Then,
a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty
propagation by introducing latent variables. Experimental results on the
challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach
outperforms the state-of-the-art video captioning benchmarks
Attend and Interact: Higher-Order Object Interactions for Video Understanding
Human actions often involve complex interactions across several inter-related
objects in the scene. However, existing approaches to fine-grained video
understanding or visual relationship detection often rely on single object
representation or pairwise object relationships. Furthermore, learning
interactions across multiple objects in hundreds of frames for video is
computationally infeasible and performance may suffer since a large
combinatorial space has to be modeled. In this paper, we propose to efficiently
learn higher-order interactions between arbitrary subgroups of objects for
fine-grained video understanding. We demonstrate that modeling object
interactions significantly improves accuracy for both action recognition and
video captioning, while saving more than 3-times the computation over
traditional pairwise relationships. The proposed method is validated on two
large-scale datasets: Kinetics and ActivityNet Captions. Our SINet and
SINet-Caption achieve state-of-the-art performances on both datasets even
though the videos are sampled at a maximum of 1 FPS. To the best of our
knowledge, this is the first work modeling object interactions on open domain
large-scale video datasets, and we additionally model higher-order object
interactions which improves the performance with low computational costs.Comment: CVPR 201
- …