1,932 research outputs found
Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding
Most of existing video-language pre-training methods focus on instance-level
alignment between video clips and captions via global contrastive learning but
neglect rich fine-grained local information, which is of importance to
downstream tasks requiring temporal localization and semantic reasoning. In
this work, we propose a simple yet effective video-language pre-training
framework, namely G-ViLM, to learn discriminative spatiotemporal features. Two
novel designs involving spatiotemporal grounding and temporal grouping promote
learning local region-noun alignment and temporal-aware features
simultaneously. Specifically, spatiotemporal grounding aggregates semantically
similar video tokens and aligns them with noun phrases extracted from the
caption to promote local region-noun correspondences. Moreover, temporal
grouping leverages cut-and-paste to manually create temporal scene changes and
then learns distinguishable features from different scenes. Comprehensive
evaluations demonstrate that G-ViLM performs favorably against existing
approaches on four representative downstream tasks, covering text-video
retrieval, video question answering, video action recognition and temporal
action localization. G-ViLM performs competitively on all evaluated tasks and
in particular achieves R@10 of 65.1 on zero-shot MSR-VTT retrieval, over 9%
higher than the state-of-the-art method
Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering
Video question answering (VideoQA) is challenging given its multimodal
combination of visual understanding and natural language processing. While most
existing approaches ignore the visual appearance-motion information at
different temporal scales, it is unknown how to incorporate the multilevel
processing capacity of a deep learning model with such multiscale information.
Targeting these issues, this paper proposes a novel Multilevel Hierarchical
Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules,
namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning
(PVR). With a multiscale sampling, RMI iterates the interaction of
appearance-motion information at each scale and the question embeddings to
build the multilevel question-guided visual representations. Thereon, with a
shared transformer encoder, PVR infers the visual cues at each level in
parallel to fit with answering different question types that may rely on the
visual information at relevant levels. Through extensive experiments on three
VideoQA datasets, we demonstrate improved performances than previous
state-of-the-arts and justify the effectiveness of each part of our method
Video Content Understanding Using Text
The rise of the social media and video streaming industry provided us a plethora of videos and their corresponding descriptive information in the form of concepts (words) and textual video captions. Due to the mass amount of available videos and the textual data, today is the best time ever to study the Computer Vision and Machine Learning problems related to videos and text. In this dissertation, we tackle multiple problems associated with the joint understanding of videos and text. We first address the task of multi-concept video retrieval, where the input is a set of words as concepts, and the output is a ranked list of full-length videos. This approach deals with multi-concept input and prolonged length of videos by incorporating multi-latent variables to tie the information within each shot (short clip of a full-video) and across shots. Secondly, we address the problem of video question answering, in which, the task is to answer a question, in the form of Fill-In-the-Blank (FIB), given a video. Answering a question is a task of retrieving a word from a dictionary (all possible words suitable for an answer) based on the input question and video. Following the FIB problem, we introduce a new problem, called Visual Text Correction (VTC), i.e., detecting and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence while benefiting 1D-CNNs/LSTMs to encode short/long term dependencies, and fix it by replacing the inaccurate word(s). Finally, as the last part of the dissertation, we propose to tackle the problem of video generation using user input natural language sentences. Our proposed video generation method constructs two distributions out of the input text, corresponding to the first and last frames latent representations. We generate high-fidelity videos by interpolating latent representations and a sequence of CNN based up-pooling blocks
- …