1,666 research outputs found
Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
State-of-the-art temporal action detectors inefficiently search the entire
video for specific actions. Despite the encouraging progress these methods
achieve, it is crucial to design automated approaches that only explore parts
of the video which are the most relevant to the actions being searched for. To
address this need, we propose the new problem of action spotting in video,
which we define as finding a specific action in a video while observing a small
portion of that video. Inspired by the observation that humans are extremely
efficient and accurate in spotting and finding action instances in video, we
propose Action Search, a novel Recurrent Neural Network approach that mimics
the way humans spot actions. Moreover, to address the absence of data recording
the behavior of human annotators, we put forward the Human Searches dataset,
which compiles the search sequences employed by human annotators spotting
actions in the AVA and THUMOS14 datasets. We consider temporal action
localization as an application of the action spotting problem. Experiments on
the THUMOS14 dataset reveal that our model is not only able to explore the
video efficiently (observing on average 17.3% of the video) but it also
accurately finds human activities with 30.8% mAP.Comment: Accepted to ECCV 201
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
The task of video grounding, which temporally localizes a natural language
description in a video, plays an important role in understanding videos.
Existing studies have adopted strategies of sliding window over the entire
video or exhaustively ranking all possible clip-sentence pairs in a
pre-segmented video, which inevitably suffer from exhaustively enumerated
candidates. To alleviate this problem, we formulate this task as a problem of
sequential decision making by learning an agent which regulates the temporal
grounding boundaries progressively based on its policy. Specifically, we
propose a reinforcement learning based framework improved by multi-task
learning and it shows steady performance gains by considering additional
supervised boundary information during training. Our proposed framework
achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset
and Charades-STA dataset while observing only 10 or less clips per video.Comment: AAAI 201
Recurrent Models of Visual Attention
Applying convolutional neural networks to large images is computationally
expensive because the amount of computation scales linearly with the number of
image pixels. We present a novel recurrent neural network model that is capable
of extracting information from an image or video by adaptively selecting a
sequence of regions or locations and only processing the selected regions at
high resolution. Like convolutional neural networks, the proposed model has a
degree of translation invariance built-in, but the amount of computation it
performs can be controlled independently of the input image size. While the
model is non-differentiable, it can be trained using reinforcement learning
methods to learn task-specific policies. We evaluate our model on several image
classification tasks, where it significantly outperforms a convolutional neural
network baseline on cluttered images, and on a dynamic visual control problem,
where it learns to track a simple object without an explicit training signal
for doing so
Temporal activity detection in untrimmed videos with recurrent neural networks
This work proposes a simple pipeline to classify and temporally localize activities in untrimmed videos. Our system uses features from a 3D Convolutional Neural Network (C3D) as input to train a a recurrent neural network (RNN) that learns to classify video clips of 16 frames. After clip prediction, we post-process the output of the RNN to assign a single activity label to each video, and determine the temporal boundaries of the activity within the video. We show how our system can achieve competitive results in both tasks with a simple architecture. We evaluate our method in the ActivityNet Challenge 2016, achieving a 0.5874 mAP and a 0.2237 mAP in the classification and detection tasks, respectively. Our code and models are publicly available at: https://imatge-upc.github.io/activitynet-2016-cvprw/Peer ReviewedPostprint (published version
- …