7,234 research outputs found
Action Tubelet Detector for Spatio-Temporal Action Localization
Current state-of-the-art approaches for spatio-temporal action localization
rely on detections at the frame level that are then linked or tracked across
time. In this paper, we leverage the temporal continuity of videos instead of
operating at the frame level. We propose the ACtion Tubelet detector
(ACT-detector) that takes as input a sequence of frames and outputs tubelets,
i.e., sequences of bounding boxes with associated scores. The same way
state-of-the-art object detectors rely on anchor boxes, our ACT-detector is
based on anchor cuboids. We build upon the SSD framework. Convolutional
features are extracted for each frame, while scores and regressions are based
on the temporal stacking of these features, thus exploiting information from a
sequence. Our experimental results show that leveraging sequences of frames
significantly improves detection performance over using individual frames. The
gain of our tubelet detector can be explained by both more accurate scores and
more precise localization. Our ACT-detector outperforms the state-of-the-art
methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in
particular at high overlap thresholds.Comment: 9 page
Skeleton-based Action Recognition of People Handling Objects
In visual surveillance systems, it is necessary to recognize the behavior of
people handling objects such as a phone, a cup, or a plastic bag. In this
paper, to address this problem, we propose a new framework for recognizing
object-related human actions by graph convolutional networks using human and
object poses. In this framework, we construct skeletal graphs of reliable human
poses by selectively sampling the informative frames in a video, which include
human joints with high confidence scores obtained in pose estimation. The
skeletal graphs generated from the sampled frames represent human poses related
to the object position in both the spatial and temporal domains, and these
graphs are used as inputs to the graph convolutional networks. Through
experiments over an open benchmark and our own data sets, we verify the
validity of our framework in that our method outperforms the state-of-the-art
method for skeleton-based action recognition.Comment: Accepted in WACV 201
A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects
Recently, Minimum Cost Multicut Formulations have been proposed and proven to
be successful in both motion trajectory segmentation and multi-target tracking
scenarios. Both tasks benefit from decomposing a graphical model into an
optimal number of connected components based on attractive and repulsive
pairwise terms. The two tasks are formulated on different levels of granularity
and, accordingly, leverage mostly local information for motion segmentation and
mostly high-level information for multi-target tracking. In this paper we argue
that point trajectories and their local relationships can contribute to the
high-level task of multi-target tracking and also argue that high-level cues
from object detection and tracking are helpful to solve motion segmentation. We
propose a joint graphical model for point trajectories and object detections
whose Multicuts are solutions to motion segmentation {\it and} multi-target
tracking problems at once. Results on the FBMS59 motion segmentation benchmark
as well as on pedestrian tracking sequences from the 2D MOT 2015 benchmark
demonstrate the promise of this joint approach
Multitemporal Very High Resolution from Space: Outcome of the 2016 IEEE GRSS Data Fusion Contest
In this paper, the scientific outcomes of the 2016 Data Fusion Contest organized by the Image Analysis and Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society are discussed. The 2016 Contest was an open topic competition based on a multitemporal and multimodal dataset, which included a temporal pair of very high resolution panchromatic and multispectral Deimos-2 images and a video captured by the Iris camera on-board the International Space Station. The problems addressed and the techniques proposed by the participants to the Contest spanned across a rather broad range of topics, and mixed ideas and methodologies from the remote sensing, video processing, and computer vision. In particular, the winning team developed a deep learning method to jointly address spatial scene labeling and temporal activity modeling using the available image and video data. The second place team proposed a random field model to simultaneously perform coregistration of multitemporal data, semantic segmentation, and change detection. The methodological key ideas of both these approaches and the main results of the corresponding experimental validation are discussed in this paper
Going Deeper into Action Recognition: A Survey
Understanding human actions in visual data is tied to advances in
complementary research areas including object recognition, human dynamics,
domain adaptation and semantic segmentation. Over the last decade, human action
analysis evolved from earlier schemes that are often limited to controlled
environments to nowadays advanced solutions that can learn from millions of
videos and apply to almost all daily activities. Given the broad range of
applications from video surveillance to human-computer interaction, scientific
milestones in action recognition are achieved more rapidly, eventually leading
to the demise of what used to be good in a short time. This motivated us to
provide a comprehensive review of the notable steps taken towards recognizing
human actions. To this end, we start our discussion with the pioneering methods
that use handcrafted representations, and then, navigate into the realm of deep
learning based approaches. We aim to remain objective throughout this survey,
touching upon encouraging improvements as well as inevitable fallbacks, in the
hope of raising fresh questions and motivating new research directions for the
reader
- …