32,101 research outputs found
Action recognition in video using a spatial-temporal graph-based feature representation
We propose a video graph based human action recognition
framework. Given an input video sequence, we extract
spatio-temporal local features and construct a video graph to incorporate appearance and motion constraints to reflect the spatio-temporal dependencies among features. them. In particular, we extend a popular dbscan density-based clustering algorithm to form an intuitive video graph. During training, we estimate a linear SVM classifier using the standard Bag-of-words method. During classification, we apply Graph-Cut optimization to find the most frequent action label in the constructed graph and assign this label to the test video sequence. The proposed approach achieves stateof-the-art performance with standard human action recognition benchmarks, namely KTH and UCF-sports datasets and competitive results for the Hollywood (HOHA) dataset
Skeleton-based Action Recognition of People Handling Objects
In visual surveillance systems, it is necessary to recognize the behavior of
people handling objects such as a phone, a cup, or a plastic bag. In this
paper, to address this problem, we propose a new framework for recognizing
object-related human actions by graph convolutional networks using human and
object poses. In this framework, we construct skeletal graphs of reliable human
poses by selectively sampling the informative frames in a video, which include
human joints with high confidence scores obtained in pose estimation. The
skeletal graphs generated from the sampled frames represent human poses related
to the object position in both the spatial and temporal domains, and these
graphs are used as inputs to the graph convolutional networks. Through
experiments over an open benchmark and our own data sets, we verify the
validity of our framework in that our method outperforms the state-of-the-art
method for skeleton-based action recognition.Comment: Accepted in WACV 201
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
Many human activities take minutes to unfold. To represent them, related
works opt for statistical pooling, which neglects the temporal structure.
Others opt for convolutional methods, as CNN and Non-Local. While successful in
learning temporal concepts, they are short of modeling minutes-long temporal
dependencies. We propose VideoGraph, a method to achieve the best of two
worlds: represent minutes-long human activities and learn their underlying
temporal structure. VideoGraph learns a graph-based representation for human
activities. The graph, its nodes and edges are learned entirely from video
datasets, making VideoGraph applicable to problems without node-level
annotation. The result is improvements over related works on benchmarks:
Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to
learn the temporal structure of human activities in minutes-long videos
- …