3,830 research outputs found
Tripping through time: Efficient Localization of Activities in Videos
Localizing moments in untrimmed videos via language queries is a new and
interesting task that requires the ability to accurately ground language into
video. Previous works have approached this task by processing the entire video,
often more than once, to localize relevant activities. In the real world
applications of this approach, such as video surveillance, efficiency is a key
system requirement. In this paper, we present TripNet, an end-to-end system
that uses a gated attention architecture to model fine-grained textual and
visual representations in order to align text and video content. Furthermore,
TripNet uses reinforcement learning to efficiently localize relevant activity
clips in long videos, by learning how to intelligently skip around the video.
It extracts visual features for few frames to perform activity classification.
In our evaluation over Charades-STA, ActivityNet Captions and the TACoS
dataset, we find that TripNet achieves high accuracy and saves processing time
by only looking at 32-41% of the entire video.Comment: Presented at BMVC, 202
Span-based Localizing Network for Natural Language Video Localization
Given an untrimmed video and a text query, natural language video
localization (NLVL) is to locate a matching span from the video that
semantically corresponds to the query. Existing solutions formulate NLVL either
as a ranking task and apply multimodal matching architecture, or as a
regression task to directly regress the target video span. In this work, we
address NLVL task with a span-based QA approach by treating the input video as
text passage. We propose a video span localizing network (VSLNet), on top of
the standard span-based QA framework, to address NLVL. The proposed VSLNet
tackles the differences between NLVL and span-based QA through a simple yet
effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to
search for matching video span within a highlighted region. Through extensive
experiments on three benchmark datasets, we show that the proposed VSLNet
outperforms the state-of-the-art methods; and adopting span-based QA framework
is a promising direction to solve NLVL.Comment: To appear at ACL 202
Weakly Supervised Video Moment Retrieval From Text Queries
There have been a few recent methods proposed in text to video moment
retrieval using natural language queries, but requiring full supervision during
training. However, acquiring a large number of training videos with temporal
boundary annotations for each text description is extremely time-consuming and
often not scalable. In order to cope with this issue, in this work, we
introduce the problem of learning from weak labels for the task of text to
video moment retrieval. The weak nature of the supervision is because, during
training, we only have access to the video-text pairs rather than the temporal
extent of the video to which different text descriptions relate. We propose a
joint visual-semantic embedding based framework that learns the notion of
relevant segments from video using only video-level sentence descriptions.
Specifically, our main idea is to utilize latent alignment between video frames
and sentence descriptions using Text-Guided Attention (TGA). TGA is then used
during the test phase to retrieve relevant moments. Experiments on two
benchmark datasets demonstrate that our method achieves comparable performance
to state-of-the-art fully supervised approaches.Comment: Revised Table 1 in Page 6, A small bug related to rounding resulted
in a slightly improved score in the previous version. Our conclusion remains
the same after the updat
Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions
This paper presents a new task, the grounding of spatio-temporal identifying
descriptions in videos. Previous work suggests potential bias in existing
datasets and emphasizes the need for a new data creation schema to better model
linguistic structure. We introduce a new data collection scheme based on
grammatical constraints for surface realization to enable us to investigate the
problem of grounding spatio-temporal identifying descriptions in videos. We
then propose a two-stream modular attention network that learns and grounds
spatio-temporal identifying descriptions based on appearance and motion. We
show that motion modules help to ground motion-related words and also help to
learn in appearance modules because modular neural networks resolve task
interference between modules. Finally, we propose a future challenge and a need
for a robust system arising from replacing ground truth visual annotations with
automatic video object detector and temporal event localization
TVQA+: Spatio-Temporal Grounding for Video Question Answering
We present the task of Spatio-Temporal Video Question Answering, which
requires intelligent systems to simultaneously retrieve relevant moments and
detect referenced visual concepts (people and objects) to answer natural
language questions about videos. We first augment the TVQA dataset with 310.8K
bounding boxes, linking depicted objects to visual concepts in questions and
answers. We name this augmented version as TVQA+. We then propose
Spatio-Temporal Answerer with Grounded Evidence (STAGE), a unified framework
that grounds evidence in both spatial and temporal domains to answer questions
about videos. Comprehensive experiments and analyses demonstrate the
effectiveness of our framework and how the rich annotations in our TVQA+
dataset can contribute to the question answering task. Moreover, by performing
this joint task, our model is able to produce insightful and interpretable
spatio-temporal attention visualizations. Dataset and code are publicly
available at: http: //tvqa.cs.unc.edu, https://github.com/jayleicn/TVQAplusComment: ACL 2020 camera-ready (15 pages
ExCL: Extractive Clip Localization Using Natural Language Descriptions
The task of retrieving clips within videos based on a given natural language
query requires cross-modal reasoning over multiple frames. Prior approaches
such as sliding window classifiers are inefficient, while text-clip similarity
driven ranking-based approaches such as segment proposal networks are far more
complicated. In order to select the most relevant video clip corresponding to
the given text description, we propose a novel extractive approach that
predicts the start and end frames by leveraging cross-modal interactions
between the text and video - this removes the need to retrieve and re-rank
multiple proposal segments. Using recurrent networks we encode the two
modalities into a joint representation which is then used in different variants
of start-end frame predictor networks. Through extensive experimentation and
ablative analysis, we demonstrate that our simple and elegant approach
significantly outperforms state of the art on two datasets and has comparable
performance on a third.Comment: Accepted at NAACL 2019, Short Pape
Modularized Textual Grounding for Counterfactual Resilience
Computer Vision applications often require a textual grounding module with
precision, interpretability, and resilience to counterfactual inputs/queries.
To achieve high grounding precision, current textual grounding methods heavily
rely on large-scale training data with manual annotations at the pixel level.
Such annotations are expensive to obtain and thus severely narrow the model's
scope of real-world applications. Moreover, most of these methods sacrifice
interpretability, generalizability, and they neglect the importance of being
resilient to counterfactual inputs. To address these issues, we propose a
visual grounding system which is 1) end-to-end trainable in a weakly supervised
fashion with only image-level annotations, and 2) counterfactually resilient
owing to the modular design. Specifically, we decompose textual descriptions
into three levels: entity, semantic attribute, color information, and perform
compositional grounding progressively. We validate our model through a series
of experiments and demonstrate its improvement over the state-of-the-art
methods. In particular, our model's performance not only surpasses other
weakly/un-supervised methods and even approaches the strongly supervised ones,
but also is interpretable for decision making and performs much better in face
of counterfactual classes than all the others.Comment: 13 pages, 12 figures, IEEE Conference on Computer Vision and Pattern
Recognition, 201
Progressive Localization Networks for Language-based Moment Localization
This paper targets the task of language-based moment localization. The
language-based setting of this task allows for an open set of target
activities, resulting in a large variation of the temporal lengths of video
moments. Most existing methods prefer to first sample sufficient candidate
moments with various temporal lengths, and then match them with the given query
to determine the target moment. However, candidate moments generated with a
fixed temporal granularity may be suboptimal to handle the large variation in
moment lengths. To this end, we propose a novel multi-stage Progressive
Localization Network (PLN) which progressively localizes the target moment in a
coarse-to-fine manner. Specifically, each stage of PLN has a localization
branch, and focuses on candidate moments that are generated with a specific
temporal granularity. The temporal granularities of candidate moments are
different across the stages. Moreover, we devise a conditional feature
manipulation module and an upsampling connection to bridge the multiple
localization branches. In this fashion, the later stages are able to absorb the
previously learned information, thus facilitating the more fine-grained
localization. Extensive experiments on three public datasets demonstrate the
effectiveness of our proposed PLN for language-based moment localization and
its potential for localizing short moments in long videos.Comment: 12page
Text-based Localization of Moments in a Video Corpus
Prior works on text-based video moment localization focus on temporally
grounding the textual query in an untrimmed video. These works assume that the
relevant video is already known and attempt to localize the moment on that
relevant video only. Different from such works, we relax this assumption and
address the task of localizing moments in a corpus of videos for a given
sentence query. This task poses a unique challenge as the system is required to
perform: (i) retrieval of the relevant video where only a segment of the video
corresponds with the queried sentence, and (ii) temporal localization of moment
in the relevant video based on sentence query. Towards overcoming this
challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns
an effective joint embedding space for moments and sentences. In addition to
learning subtle differences between intra-video moments, HMAN focuses on
distinguishing inter-video global semantic concepts based on sentence queries.
Qualitative and quantitative results on three benchmark text-based video moment
retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions -
demonstrate that our method achieves promising performance on the proposed task
of temporal localization of moments in a corpus of videos
Temporal Localization of Moments in Video Collections with Natural Language
In this paper, we introduce the task of retrieving relevant video moments
from a large corpus of untrimmed, unsegmented videos given a natural language
query. Our task poses unique challenges as a system must efficiently identify
both the relevant videos and localize the relevant moments in the videos. This
task is in contrast to prior work that localizes relevant moments in a single
video or searches a large collection of already-segmented videos. For our task,
we introduce Clip Alignment with Language (CAL), a model that aligns features
for a natural language query to a sequence of short video clips that compose a
candidate moment in a video. Our approach goes beyond prior work that
aggregates video features over a candidate moment by allowing for finer clip
alignment. Moreover, our approach is amenable to efficient indexing of the
resulting clip-level representations, which makes it suitable for moment
localization in large video collections. We evaluate our approach on three
recently proposed datasets for temporal localization of moments in video with
natural language extended to our video corpus moment retrieval setting: DiDeMo,
Charades-STA, and ActivityNet-captions. We show that our CAL model outperforms
the recently proposed Moment Context Network (MCN) on all criteria across all
datasets on our proposed task, obtaining an 8%-85% and 11%-47% boost for
average recall and median rank, respectively, and achieves 5x faster retrieval
and 8x smaller index size with a 500K video corpus
- …