7,216 research outputs found
CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos
Temporal action localization is an important yet challenging problem. Given a
long, untrimmed video consisting of multiple action instances and complex
background contents, we need not only to recognize their action categories, but
also to localize the start time and end time of each instance. Many
state-of-the-art systems use segment-level classifiers to select and rank
proposal segments of pre-determined boundaries. However, a desirable model
should move beyond segment-level and make dense predictions at a fine
granularity in time to determine precise temporal boundaries. To this end, we
design a novel Convolutional-De-Convolutional (CDC) network that places CDC
filters on top of 3D ConvNets, which have been shown to be effective for
abstracting action semantics but reduce the temporal length of the input data.
The proposed CDC filter performs the required temporal upsampling and spatial
downsampling operations simultaneously to predict actions at the frame-level
granularity. It is unique in jointly modeling action semantics in space-time
and fine-grained temporal dynamics. We train the CDC network in an end-to-end
manner efficiently. Our model not only achieves superior performance in
detecting actions in every frame, but also significantly boosts the precision
of localizing temporal boundaries. Finally, the CDC network demonstrates a very
high efficiency with the ability to process 500 frames per second on a single
GPU server. We will update the camera-ready version and publish the source
codes online soon.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
201
Joint learning of object and action detectors
International audienceWhile most existing approaches for detection in videos focus on objects or human actions separately, we aim at jointly detecting objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting objects-actions in videos, and show that both tasks of object and action detection benefit from this joint learning. Moreover, the proposed architecture can be used for zero-shot learning of actions: our multitask objective leverages the commonalities of an action performed by different objects, e.g. dog and cat jumping , enabling to detect actions of an object without training with these object-actions pairs. In experiments on the A2D dataset [50], we obtain state-of-the-art results on segmentation of object-action pairs. We finally apply our multitask architecture to detect visual relationships between objects in images of the VRD dataset [24]
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
Action Recognition by Hierarchical Mid-level Action Elements
Realistic videos of human actions exhibit rich spatiotemporal structures at
multiple levels of granularity: an action can always be decomposed into
multiple finer-grained elements in both space and time. To capture this
intuition, we propose to represent videos by a hierarchy of mid-level action
elements (MAEs), where each MAE corresponds to an action-related spatiotemporal
segment in the video. We introduce an unsupervised method to generate this
representation from videos. Our method is capable of distinguishing
action-related segments from background segments and representing actions at
multiple spatiotemporal resolutions. Given a set of spatiotemporal segments
generated from the training data, we introduce a discriminative clustering
algorithm that automatically discovers MAEs at multiple levels of granularity.
We develop structured models that capture a rich set of spatial, temporal and
hierarchical relations among the segments, where the action label and multiple
levels of MAE labels are jointly inferred. The proposed model achieves
state-of-the-art performance in multiple action recognition benchmarks.
Moreover, we demonstrate the effectiveness of our model in real-world
applications such as action recognition in large-scale untrimmed videos and
action parsing
- …