16 research outputs found
Spatio-Temporal Ranked-Attention Networks for Video Captioning
Generating video descriptions automatically is a challenging task that
involves a complex interplay between spatio-temporal visual features and
language models. Given that videos consist of spatial (frame-level) features
and their temporal evolutions, an effective captioning model should be able to
attend to these different cues selectively. To this end, we propose a
Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned
on the language state, hierarchically combines spatial and temporal attention
to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which
first attends to regions that have temporal evolution, then temporally pools
the features from these regions; and (ii) a temporo-spatial (TS) sub-model,
which first decides a single frame to attend to, then applies spatial attention
within that frame. We propose a novel LSTM-based temporal ranking function,
which we call ranked attention, for the ST model to capture action dynamics.
Our entire framework is trained end-to-end. We provide experiments on two
benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy
between the ST and TS modules, outperforming recent state-of-the-art methods
Meaning guided video captioning
Current video captioning approaches often suffer from problems of missing
objects in the video to be described, while generating captions semantically
similar with ground truth sentences. In this paper, we propose a new approach
to video captioning that can describe objects detected by object detection, and
generate captions having similar meaning with correct captions. Our model
relies on S2VT, a sequence-to-sequence model for video captioning. Given a
sequence of video frames, the encoding RNN takes a frame as well as detected
objects in the frame in order to incorporate the information of the objects in
the scene. The following decoding RNN outputs are then fed into an attention
layer and then to a decoder for generating captions. The caption is compared
with the ground truth by learning metric so that vector representations of
generated captions are semantically similar to those of ground truth.
Experimental results with the MSDV dataset demonstrate that the performance of
the proposed approach is much better than the model without the proposed
meaning-guided framework, showing the effectiveness of the proposed model. Code
are publicly available at
https://github.com/captanlevi/Meaning-guided-video-captioning-.Comment: The 5th Asian Conference on Pattern Recognition (ACPR 2019
SBAT: Video Captioning with Sparse Boundary-Aware Transformer
In this paper, we focus on the problem of applying the transformer structure
to video captioning effectively. The vanilla transformer is proposed for
uni-modal language generation task such as machine translation. However, video
captioning is a multimodal learning problem, and the video features have much
redundancy between different time steps. Based on these concerns, we propose a
novel method called sparse boundary-aware transformer (SBAT) to reduce the
redundancy in video representation. SBAT employs boundary-aware pooling
operation for scores from multihead attention and selects diverse features from
different scenarios. Also, SBAT includes a local correlation scheme to
compensate for the local information loss brought by sparse operation. Based on
SBAT, we further propose an aligned cross-modal encoding scheme to boost the
multimodal interaction. Experimental results on two benchmark datasets show
that SBAT outperforms the state-of-the-art methods under most of the metrics.Comment: Appearing at IJCAI 202
Deep hierarchical pooling design for cross-granularity action recognition
In this paper, we introduce a novel hierarchical aggregation design that
captures different levels of temporal granularity in action recognition. Our
design principle is coarse-to-fine and achieved using a tree-structured
network; as we traverse this network top-down, pooling operations are getting
less invariant but timely more resolute and well localized. Learning the
combination of operations in this network -- which best fits a given
ground-truth -- is obtained by solving a constrained minimization problem whose
solution corresponds to the distribution of weights that capture the
contribution of each level (and thereby temporal granularity) in the global
hierarchical pooling process. Besides being principled and well grounded, the
proposed hierarchical pooling is also video-length agnostic and resilient to
misalignments in actions. Extensive experiments conducted on the challenging
UCF-101 database corroborate these statements
Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge
Audio-Visual Scene-Aware Dialog (AVSD) is an extension from Video Question
Answering (QA) whereby the dialogue agent is required to generate natural
language responses to address user queries and carry on conversations. This is
a challenging task as it consists of video features of multiple modalities,
including text, visual, and audio features. The agent also needs to learn
semantic dependencies among user utterances and system responses to make
coherent conversations with humans. In this work, we describe our submission to
the AVSD track of the 8th Dialogue System Technology Challenge. We adopt
dot-product attention to combine text and non-text features of input video. We
further enhance the generation capability of the dialogue agent by adopting
pointer networks to point to tokens from multiple source sequences in each
generation step. Our systems achieve high performance in automatic metrics and
obtain 5th and 6th place in human evaluation among all submissions.Comment: Accepted at DSTC Workshop at AAAI 202
Vision and Language: from Visual Perception to Content Creation
Vision and language are two fundamental capabilities of human intelligence.
Humans routinely perform tasks through the interactions between vision and
language, supporting the uniquely human capacity to talk about what they see or
hallucinate a picture on a natural-language description. The valid question of
how language interacts with vision motivates us researchers to expand the
horizons of computer vision area. In particular, "vision to language" is
probably one of the most popular topics in the past five years, with a
significant growth in both volume of publications and extensive applications,
e.g., captioning, visual question answering, visual dialog, language
navigation, etc. Such tasks boost visual perception with more comprehensive
understanding and diverse linguistic representations. Going beyond the
progresses made in "vision to language," language can also contribute to vision
understanding and offer new possibilities of visual content creation, i.e.,
"language to vision." The process performs as a prism through which to create
visual content conditioning on the language inputs. This paper reviews the
recent advances along these two dimensions: "vision to language" and "language
to vision." More concretely, the former mainly focuses on the development of
image/video captioning, as well as typical encoder-decoder structures and
benchmarks, while the latter summarizes the technologies of visual content
creation. The real-world deployment or services of vision and language are
elaborated as well
Empirical Autopsy of Deep Video Captioning Frameworks
Contemporary deep learning based video captioning follows encoder-decoder
framework. In encoder, visual features are extracted with 2D/3D Convolutional
Neural Networks (CNNs) and a transformed version of those features is passed to
the decoder. The decoder uses word embeddings and a language model to map
visual features to natural language captions. Due to its composite nature, the
encoder-decoder pipeline provides the freedom of multiple choices for each of
its components, e.g the choices of CNNs models, feature transformations, word
embeddings, and language models etc. Component selection can have drastic
effects on the overall video captioning performance. However, current
literature is void of any systematic investigation in this regard. This article
fills this gap by providing the first thorough empirical analysis of the role
that each major component plays in a contemporary video captioning pipeline. We
perform extensive experiments by varying the constituent components of the
video captioning framework, and quantify the performance gains that are
possible by mere component selection. We use the popular MSVD dataset as the
test-bed, and demonstrate that substantial performance gains are possible by
careful selection of the constituent components without major changes to the
pipeline itself. These results are expected to provide guiding principles for
future research in the fast growing direction of video captioning.Comment: 09 pages, 05 figure
Relational Reasoning using Prior Knowledge for Visual Captioning
Exploiting relationships among objects has achieved remarkable progress in
interpreting images or videos by natural language. Most existing methods resort
to first detecting objects and their relationships, and then generating textual
descriptions, which heavily depends on pre-trained detectors and leads to
performance drop when facing problems of heavy occlusion, tiny-size objects and
long-tail in object detection. In addition, the separate procedure of detecting
and captioning results in semantic inconsistency between the pre-defined
object/relation categories and the target lexical words. We exploit prior human
commonsense knowledge for reasoning relationships between objects without any
pre-trained detectors and reaching semantic coherency within one image or video
in captioning. The prior knowledge (e.g., in the form of knowledge graph)
provides commonsense semantic correlation and constraint between objects that
are not explicit in the image and video, serving as useful guidance to build
semantic graph for sentence generation. Particularly, we present a joint
reasoning method that incorporates 1) commonsense reasoning for embedding image
or video regions into semantic space to build semantic graph and 2) relational
reasoning for encoding semantic graph to generate sentences. Extensive
experiments on the MS-COCO image captioning benchmark and the MSVD video
captioning benchmark validate the superiority of our method on leveraging prior
commonsense knowledge to enhance relational reasoning for visual captioning
Diverse Video Captioning Through Latent Variable Expansion with Conditional GAN
Automatically describing video content with text description is challenging
but important task, which has been attracting a lot of attention in computer
vision community. Previous works mainly strive for the accuracy of the
generated sentences, while ignoring the sentences diversity, which is
inconsistent with human behavior. In this paper, we aim to caption each video
with multiple descriptions and propose a novel framework. Concretely, for a
given video, the intermediate latent variables of conventional encode-decode
process are utilized as input to the conditional generative adversarial network
(CGAN) with the purpose of generating diverse sentences. We adopt different
Convolutional Neural Networks (CNNs) as our generator that produces
descriptions conditioned on latent variables and discriminator that assesses
the quality of generated sentences. Simultaneously, a novel DCE metric is
designed to assess the diverse captions. We evaluate our method on the
benchmark datasets, where it demonstrates its ability to generate diverse
descriptions and achieves superior results against other state-of-the-art
methods
Video Captioning Using Weak Annotation
Video captioning has shown impressive progress in recent years. One key
reason of the performance improvements made by existing methods lie in massive
paired video-sentence data, but collecting such strong annotation, i.e.,
high-quality sentences, is time-consuming and laborious. It is the fact that
there now exist an amazing number of videos with weak annotation that only
contains semantic concepts such as actions and objects. In this paper, we
investigate using weak annotation instead of strong annotation to train a video
captioning model. To this end, we propose a progressive visual reasoning method
that progressively generates fine sentences from weak annotations by inferring
more semantic concepts and their dependency relationships for video captioning.
To model concept relationships, we use dependency trees that are spanned by
exploiting external knowledge from large sentence corpora. Through traversing
the dependency trees, the sentences are generated to train the captioning
model. Accordingly, we develop an iterative refinement algorithm that refines
sentences via spanning dependency trees and fine-tunes the captioning model
using the refined sentences in an alternative training manner. Experimental
results demonstrate that our method using weak annotation is very competitive
to the state-of-the-art methods using strong annotation